How should a software product handle an access violation

How should a software product handle an access violation - c++

We have a software product in c++, that due to documented problems with the compiler generates faulty code (Yes I know it is horrible in itself). This among other bugs causes Access Violations to be thrown.
Our response to that is to catch the error and continue running.
My question is, is this a responsible approach? Is it responsible to let an application live when it has failed so disasterously? Would it be more responsible to alert the user and die?
Edit:
One of the arguments of letting the exception unhandled is that Access Violation shows that the program was prevented from doing harm, and probably haven't done any either. I am not sure if I buy that. Are there any views on this?

I'm with Ignacio: It's imperative to get a fix for that compiler ASAP, or if such a fix is not forthcoming, to jump ship. Naturally there may be barriers to doing so, and I'm guessing you're looking for a short-term solution en route to achieving that goal. :-)
If the faulty code problem is not very narrowly constrained to a known, largely harmless situation, then I'd tend to think continuing to produce and ship the product with the faulty code could be considered irresponsible, regardless of how you handle the violation.
If it's a very narrowly constrained, known situation, how you handle it depends on the situation. You seem to know what the fault is, so you're in the position to know whether you can carry on in the face of that fault or not. I would tend to lean toward report and exit, but again, it totally depends on what the fault actually is.

Really it goes without saying, but it's irresponsible to act like the program did something it didn't (when it should have set some value somewhere which actually was a dangling pointer), or didn't do something it shouldn't have (when it randomizes some variable somewhere unfortunate enough to be the destination of a dangling pointer).
Damage minimization/mitigation strategies might be to checksum files (but not in a trivial way; actually verify that untouched data within the file is unmodified) and auto-save often.
Do you think the customer is aware of the problem?

Related

Why is a segmentation fault not recoverable?

Following a previous question of mine, most comments say "just don't, you are in a limbo state, you have to kill everything and start over". There is also a "safeish" workaround.
What I fail to understand is why a segmentation fault is inherently nonrecoverable.
The moment in which writing to protected memory is caught - otherwise, the SIGSEGV would not be sent.
If the moment of writing to protected memory can be caught, I don't see why - in theory - it can't be reverted, at some low level, and have the SIGSEGV converted to a standard software exception.
Please explain why after a segmentation fault the program is in an undetermined state, as very obviously, the fault is thrown before memory was actually changed (I am probably wrong and don't see why). Had it been thrown after, one could create a program that changes protected memory, one byte at a time, getting segmentation faults, and eventually reprogramming the kernel - a security risk that is not present, as we can see the world still stands.
When exactly does a segmentation fault happen (= when is SIGSEGV sent)?
Why is the process in an undefined behavior state after that point?
Why is it not recoverable?
Why does this solution avoid that unrecoverable state? Does it even?

When exactly does segmentation fault happen (=when is SIGSEGV sent)?
When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".
Why is the process in undefined behavior state after that point?
Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.
Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.
If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.
Why is it not recoverable?
Because the OS doesn’t know what your program is supposed to be doing.
Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.
Why does this solution avoid that unrecoverable state? Does it even?
That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.
We have to realize that a segmentation fault is a symptom of a bug, not the cause.

Please explain why after a segmentation fault the program is in an undetermined state
I think this is your fundamental misunderstanding -- the SEGV does not cause the undetermined state, it is a symptom of it. So the problem is (generally) that the program is in an illegal, unrecoverable state WELL BEFORE the SIGSEGV occurs, and recovering from the SIGSEGV won't change that.
When exactly does segmentation fault happen (=when is SIGSEGV sent)?
The only standard way in which a SIGSEGV occurs is with the call raise(SIGSEGV);. If this is the source of a SIGSEGV, then it is obviously recoverable by using longjump. But this is a trivial case that never happens in reality. There are platform-specific ways of doing things that might result in well-defined SEGVs (eg, using mprotect on a POSIX system), and these SEGVs might be recoverable (but will likely require platform specific recovery). However, the danger of undefined-behavior related SEGV generally means that the signal handler will very carefully check the (platform dependent) information that comes along with the signal to make sure it is something that is expected.
Why is the process in undefined behavior state after that point?
It was (generally) in undefined behavior state before that point; it just wasn't noticed. That's the big problem with Undefined Behavior in both C and C++ -- there's no specific behavior associated with it, so it might not be noticed right away.
Why does this solution avoid that unrecoverable state? Does it even?
It does not, it just goes back to some earlier point, but doesn't do anything to undo or even identify the undefined behavior that cause the problem.

A segfault happens when your program tries to dereference a bad pointer. (See below for a more technical version of that, and other things that can segfault.) At that point, your program has already tripped over a bug that led to the pointer being bad; the attempt to deref it is often not the actual bug.
Unless you intentionally do some things that can segfault, and intend to catch and handle those cases (see section below), you won't know what got messed up by a bug in your program (or a cosmic ray flipping a bit) before a bad access actually faulted. (And this generally requires writing in asm, or running code you JITed yourself, not C or C++.)
C and C++ don't define the behaviour of programs that cause segmentation faults, so compilers don't make machine-code that anticipates attempted recovery. Even in a hand-written asm program, it wouldn't make sense to try unless you expected some kinds of segfaults, there's no sane way to try to truly recover; at most you should just print an error message before exiting.
If you mmap some new memory at whatever address the access way trying to access, or mprotect it from read-only to read+write (in a SIGSEGV handler), that can let the faulting instruction execute, but that's very unlikely to let execution resume. Most read-only memory is read-only for a reason, and letting something write to it won't be helpful. And an attempt to read something through a pointer probably needed to get some specific data that's actually somewhere else (or to not be reading at all because there's nothing to read). So mapping a new page of zeros to that address will let execution continue, but not useful correct execution. Same for modifying the main thread's instruction pointer in a SIGSEGV handler, so it resumes after the faulting instruction. Then whatever load or store will just have not happened, using whatever garbage was previously in a register (for a load), or similar other results for CISC add reg, [mem] or whatever.
(The example you linked of catching SIGSEGV depends on the compiler generating machine code in the obvious way, and the setjump/longjump depends on knowing which code is going to segfault, and that it happened without first overwriting some valid memory, e.g. the stdout data structures that printf depends on, before getting to an unmapped page, like could happen with a loop or memcpy.)
Expected SIGSEGVs, for example a JIT sandbox
A JIT for a language like Java or Javascript (which don't have undefined behaviour) needs to handle null-pointer dereferences in a well-defined way, by (Java) throwing a NullPointerException in the guest machine.
Machine code implementing the logic of a Java program (created by a JIT compiler as part of a JVM) would need to check every reference at least once before using, in any case where it couldn't prove at JIT-compile time that it was non-null, if it wanted to avoid ever having the JITed code fault.
But that's expensive, so a JIT may eliminate some null-pointer checks by allowing faults to happen in the guest asm it generates, even though such a fault will first trap to the OS, and only then to the JVM's SIGSEGV handler.
If the JVM is careful in how it lays out the asm instructions its generating, so any possible null pointer deref will happen at the right time wrt. side-effects on other data and only on paths of execution where it should happen (see #supercat's answer for an example), then this is valid. The JVM will have to catch SIGSEGV and longjmp or whatever out of the signal handler, to code that delivers a NullPointerException to the guest.
But the crucial part here is that the JVM is assuming its own code is bug-free, so the only state that's potentially "corrupt" is the guest actual state, not the JVM's data about the guest. This means the JVM is able to process an exception happening in the guest without depending on data that's probably corrupt.
The guest itself probably can't do much, though, if it wasn't expecting a NullPointerException and thus doesn't specifically know how to repair the situation. It probably shouldn't do much more than print an error message and exit or restart itself. (Pretty much what a normal ahead-of-time-compiled C++ program is limited to.)
Of course the JVM needs to check the fault address of the SIGSEGV and find out exactly which guest code it was in, to know where to deliver the NullPointerException. (Which catch block, if any.) And if the fault address wasn't in JITed guest code at all, then the JVM is just like any other ahead-of-time-compiled C/C++ program that segfaulted, and shouldn't do much more than print an error message and exit. (Or raise(SIGABRT) to trigger a core dump.)
Being a JIT JVM doesn't make it any easier to recover from unexpected segfaults due to bugs in your own logic. The key thing is that there's a sandboxed guest which you're already making sure can't mess up the main program, and its faults aren't unexpected for the host JVM. (You can't allow "managed" code in the guest to have fully wild pointers that could be pointing anywhere, e.g. to guest code. But that's normally fine. But you can still have null pointers, using a representation that does in practice actually fault if hardware tries to deref it. That doesn't let it write or read the host's state.)
For more about this, see Why are segfaults called faults (and not aborts) if they are not recoverable? for an asm-level view of segfaults. And links to JIT techniques that let guest code page-fault instead of doing runtime checks:
Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.
SableVM: 6.2.4 Hardware Support on Various Architectures about NULL pointer checks
A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.
Implicit Java Array Bounds Checking on 64-bit
Architectures. They talk about what to do when array size isn't a multiple of the page size, and other caveats.
Background: what are segfaults
The usual reason for the OS delivering SIGSEGV is after your process triggers a page fault that the OS finds is "invalid". (I.e. it's your fault, not the OS's problem, so it can't fix it by paging in data that was swapped out to disk (hard page fault) or copy-on-write or zero a new anonymous page on first access (soft page fault), and updating the hardware page tables for that virtual page to match what your process logically has mapped.).
The page-fault handler can't repair the situation because the user-space thread normally because user-space hasn't asked the OS for any memory to be mapped to that virtual address. If it did just try to resume user-space without doing anything to the page table, the same instruction would just fault again, so instead the kernel delivers a SIGSEGV. The default action for that signal is to kill the process, but if user-space has installed a signal handler it can catch it.
Other reasons include (on Linux) trying to run a privileged instruction in user-space (e.g. an x86 #GP "General Protection Fault" hardware exception), or on x86 Linux a misaligned 16-byte SSE load or store (again a #GP exception). This can happen with manually-vectorized code using _mm_load_si128 instead of loadu, or even as a result of auto-vectorization in a program with undefined behaviour: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? (Some other OSes, e.g. MacOS / Darwin, deliver SIGBUS for misaligned SSE.)
Segfaults usually only happen after your program encountered a bug
So your program state is already messed up, that's why there was for example a NULL pointer where you expected one to be non-NULL, or otherwise invalid. (e.g. some forms of use-after free, or a pointer overwritten with some bits that don't represent a valid pointer.)
If you're lucky it will segfault and fail early and noisily, as close as possible to the actual bug; if you're unlucky (e.g. corrupting malloc bookkeeping info) you won't actually segfault until long after the buggy code executed.

The thing you have to understand about segmentation faults is that they are not a problem. They are an example of the Lord's near-infinite mercy (according to an old professor I had in college). A segmentation fault is a sign that something is very wrong, and your program thought it was a good idea to access memory where there was no memory to be had. That access is not in itself the problem; the problem came at some indeterminate time before, when something went wrong, that eventually caused your program to think that this access was a good idea. Accessing non-existent memory is just a symptom at this point, but (and this is where the Lord's mercy comes into it) it's an easily-detected symptom. It could be much worse; it could be accessing memory where there is memory to be had, just, the wrong memory. The OS can't save you from that.
The OS has no way to figure out what caused your program to believe something so absurd, and the only thing it can do is shut things down, before it does something else insane in a way the OS can't detect so easily. Usually, most OSes also provide a core dump (a saved copy of the program's memory), which could in theory be used to figure out what the program thought it was doing. This isn't really straightforward for any non-trivial program, but that's why the OS does it, just in case.

While your question asks specifically about segmentation faults, the real question is:
If a software or hardware component is commanded to do something nonsensical or even impossible, what should it do? Do nothing at all? Guess what actually needs to be done and do that? Or use some mechanism (such as "throwing an exception") to halt the higher-level computation which issued the nonsensical command?
The vast weight of experience gathered by many engineers, over many years, agrees that the best answer is halting the overall computation, and producing diagnostic information which may help someone figure out what is wrong.
Aside from illegal access to protected or nonexistent memory, other examples of 'nonsensical commands' include telling a CPU to divide an integer by zero or to execute junk bytes which do not decode to any valid instruction. If a programming language with run-time type checking is used, trying to invoke any operation which is not defined for the data types involved is another example.
But why is it better to force a program which tries to divide by zero to crash? Nobody wants their programs to crash. Couldn't we define division-by-zero to equal some number, such as zero, or 73? And couldn't we create CPUs which would skip over invalid instructions without faulting? Maybe our CPUs could also return some special value, like -1, for any read from a protected or unmapped memory address. And they could just ignore writes to protected addresses. No more segfaults! Whee!
Certainly, all those things could be done, but it wouldn't really gain anything. Here's the point: While nobody wants their programs to crash, not crashing does not mean success. People write and run computer programs to do something, not just to "not crash". If a program is buggy enough to read or write random memory addresses or attempt to divide by zero, the chances are very low that it will do what you actually want, even if it is allowed to continue running. On the other hand, if the program is not halted when it attempts crazy things, it may end up doing something that you do not want, such as corrupting or destroying your data.
Historically, some programming languages have been designed to always "just do something" in response to nonsensical commands, rather than raising a fatal error. This was done in a misguided attempt to be more friendly to novice programmers, but it always ended badly. The same would be true of your suggestion that operating systems should never crash programs due to segfaults.

At the machine-code level, many platforms would allow programs that are "expecting" segmentation faults in certain circumstances to adjust the memory configuration and resume execution. This may be useful for implementing things like stack monitoring. If one needs to determine the maximum amount of stack that was ever used by an application, one could set the stack segment to allow access only to a small amount of stack, and then respond to segmentation faults by adjusting the bounds of the stack segment and resuming code execution.
At the C language level, however, supporting such semantics would greatly impede optimization. If one were to write something like:
void test(float *p, int *q)
{
float temp = *p;
if (*q += 1)
function2(temp);
}
a compiler might regard the read of *p and the read-modify-write sequence on *q as being unsequenced relative to each other, and generate code that only reads *p in cases where the initial value of *q wasn't -1. This wouldn't affect program behavior anything if p were valid, but if p was invalid this change could result in the segment fault from the access to *p occurring after *q was incremented even though the access that triggered the fault was performed before the increment.
For a language to efficiently and meaningfully support recoverable segment faults, it would have to document the range of permissible and non-permissible optimizations in much more detail than the C Standard has ever done, and I see no reason to expect future versions of the C Standard to include such detail.

It is recoverable, but it is usually a bad idea.
For example Microsoft C++ compiler has option to turn segfaults into exceptions.
You can see the Microsoft SEH documentation, but even they do not suggest using it.

Honestly if I could tell the computer to ignore a segmentation fault. I would not take this option.
Usually the segmentation fault occurs because you are dereferencing either a null pointer or a deallocated pointer. When dereferencing null the behavior is completely undefined. When referencing a deallocated pointer the data you are pulling either could be the old value, random junk or in the worst case values from another program. In either case I want the program to segfault and not continue and report junk calculations.

Segmentation faults were a constant thorn in my side for many years. I worked primarily on embedded platforms and since we were running on bare metal, there was no file system on which to record a core dump. The system just locked up and died, perhaps with a few parting characters out the serial port. One of the more enlightening moments from those years was when I realized that segmentation faults (and similar fatal errors) are a good thing. Experiencing one is not good, but having them in place as hard, unavoidable failure points is.
Faults like that aren't generated lightly. The hardware has already tried everything it can to recover, and the fault is the hardware's way of warning you that continuing is dangerous. So much, in fact, that bringing the whole process/system crashing down is actually safer than continuing. Even in systems with protected/virtual memory, continuing execution after this sort of fault can destabilize the rest of the system.
If the moment of writing to protected memory can be caught
There are more ways to get into a segfault than just writing to protected memory. You can also get there by e.g., reading from a pointer with an invalid value. That's either caused by previous memory corruption (the damage has already been done, so it's too late to recover) or by a lack of error checking code (should have been caught by your static analyzer and/or tests).
Why is it not recoverable?
You don't necessarily know what caused the problem or what the extent of it is, so you can't know how to recover from it. If your memory has been corrupted, you can't trust anything. The cases where this would be recoverable are cases where you could have detected the problem ahead of time, so using an exception isn't the right way to solve the problem.
Note that some of these types of problems are recoverable in other languages like C#. Those languages typically have an extra runtime layer that's checking pointer addresses ahead of time and throwing exceptions before the hardware generates a fault. You don't have any of that with low-level languages like C, though.
Why does this solution avoid that unrecoverable state? Does it even?
That technique "works", but only in contrived, simplistic use cases. Continuing to execute is not the same as recovering. The system in question is still in the faulted state with unknown memory corruption, you're just choosing to continue blazing onward instead of heeding the hardware's advice to take the problem seriously. There's no telling what your program would do at that point. A program that continues to execute after potential memory corruption would be an early Christmas gift for an attacker.
Even if there wasn't any memory corruption, that solution breaks in many different common use cases. You can't enter a second protected block of code (such as inside a helper function) while already inside of one. Any segfault that happens outside a protected block of code will result in a jump to an unpredictable point in your code. That means every line of code needs to be in a protective block and your code will be obnoxious to follow. You can't call external library code, since that code doesn't use this technique and won't set the setjmp anchor. Your "handler" block can't call library functions or do anything involving pointers or you risk needing endlessly-nested blocks. Some things like automatic variables can be in an unpredictable state after a longjmp.
One thing missing here, about mission critical systems (or any
system): In large systems in production, one can't know where, or
even if the segfaults are, so the reccomendation to fix the bug and
not the symptom does not hold.
I don't agree with this thought. Most segmentation faults that I've seen are caused by dereferencing pointers (directly or indirectly) without validating them first. Checking pointers before you use them will tell you where the segfaults are. Split up complex statements like my_array[ptr1->offsets[ptr2->index]] into multiple statements so that you can check the intermediate pointers as well. Static analyzers like Coverity are good about finding code paths where pointers are used without being validated. That won't protect you against segfaults caused by outright memory corruption, but there's no way to recover from that situation in any case.
In short-term practice, I think my errors are only access to
null and nothing more.
Good news! This whole discussion is moot. Pointers and array indices can (and should!) be validated before they are used, and checking ahead of time is far less code than waiting for a problem to happen and trying to recover.

This might not be a complete answer, and it is by no means complete or accurate, but it doesn't fit into a comment
So a SIGSEGV can occur when you try to access memory in a way that you should not (like writing to it when it is read-only or reading from an address range that is not mapped). Such an error alone might be recoverable if you know enough about the environment.
But how do you want to determine why that invalid access happened in the first place.
In one comment to another answer you say:
short-term practice, I think my errors are only access to null and nothing more.
No application is error-free so why do you assume if null pointer access can happen that your application does not e.g. also have a situation where a use after free or an out of bounds access to "valid" memory locations happens, that doesn't immediately result in an error or a SIGSEGV.
A use-after-free or out-of-bounds access could also modify a pointer into pointing to an invalid location or into being a nullptr, but it could also have changed other locations in the memory at the same time. If you now only assume that the pointer was just not initialized and your error handling only considers this, you continue with an application that is in a state that does not match your expectation or one of the compilers had when generating the code.
In that case, the application will - in the best case - crash shortly after the "recovery" in the worst case some variables have faulty values but it will continue to run with those. This oversight could be more harmful for a critical application than restarting it.
If you however know that a certain action might under certain circumstances result in a SIGSEGV you can handle that error, e.g. that you know that the memory address is valid, but that the device the memory is mapped to might not be fully reliable and might cause a SIGSEGV due to that then recovering from a SIGSEGV might be a valid approach.

Depends what you mean by recovery. The only sensible recovery in case the OS sends you the SEGV signal is to clean up your program and spin another one from the start, hopefully not hitting the same pitfall.
You have no way to know how much your memory got corrupted before the OS called an end to the chaos. Chances are if you try to continue from the next instruction or some arbitrary recovery point, your program will misbehave further.
The thing that it seems many of the upvoted responses are forgetting is that there are applications in which segfaults can happen in production without a programming error. And where high availability, decades of lifetime and zero maintenance are expected. In those environments, what's typically done is that the program is restarted if it crashes for any reason, segfault included. Additionally, a watchdog functionality is used to ensure that the program does not get stuck in an unplanned infinite loop.
Think of all the embedded devices you rely on that have no reset button. They rely on imperfect hardware, because no hardware is perfect. The software has to deal with hardware imperfections. In other words, the software must be robust against hardware misbehavior.
Embedded isn't the only area where this is crucial. Think of the amount of servers handling just StackOverflow. The chance of ionizing radiation causing a single event upset is tiny if you look at any one operation at ground level, but this probability becomes non-trivial if you look at a large number of computers running 24/7. ECC memory helps against this, but not everything can be protected.

Your program is an undertermined state because C can't define the state. The bugs which cause these errors are undefined behavior. This is the nastiest class of bad behaviors.
The key issue with recovering from these things is that, being undefined behavior, the complier is not obliged to support them in any way. In particular, it may have done optimizations which, if only defined behaviors occur, provably have the same effect. The compiler is completely within its rights to reorder lines, skip lines, and do all sorts of fancy tricks to make your code run faster. All it has to do is prove that the effect is the same according to the C++ virtual machine model.
When an undefined behavior occurs, all that goes out the window. You may get into difficult situations where the compiler has reordered operations and now can't get you to a state which you could arrive at by executing your program for a period of time. Remember that assignments erase the old value. If an assignment got moved up before the line that segfaulted, you can't recover the old value to "unwind" the optimization.
The behavior of this reordered code was indeed identical to the original, as long as no undefined behavior occurred. Once the undefined behavior occurred, it exposes the fact that the reorder occurred and could change results.
The tradeoff here is speed. Because the compiler isn't walking on eggshells, terrified of some unspecified OS behavior, it can do a better job of optimizing your code.
Now, because undefined behavior is always undefined behavior, no matter how much you wish it wasn't, there cannot be a spec C++ way to handle this case. The C++ language can never introduce a way to resolve this, at least short of making it defined behavior, and paying the costs for that. On a given platform and compiler, you may be able to identify that this undefined behavior is actually defined by your compiler, typically in the form of extensions. Indeed, the answer I linked earlier shows a way to turn a signal into an exception, which does indeed work on at least one platform/compiler pair.
But it always has to be on the fringe like this. The C++ developers value the speed of optimized code over defining this undefined behavior.

As you use the term SIGSEGV I believe you are using a system with an operating system and that the problem occurs in your user land application.
When the application gets the SIGSEGV it is a symptom of something gone wrong before the memory access. Sometimes it can be pinpointed to exactly where things went wrong, generally not. So something went wrong, and a while later this wrong was the cause of a SIGSEGV. If the error happened "in the operating system" my reaction would be to shut down the system. With a very specific exceptions -- when the OS has a specific function to check for memory card or IO card installed (or perhaps removed).
In the user land I would probably divide my application into several processes. One or more processes would do the actual work. Another process would monitor the worker process(es) and could discover when one of them fails. A SIGSEGV in a worker process could then be discovered by the monitor process, which could restart the worker process or do a fail-over or whatever is deemed appropriate in the specific case. This would not recover the actual memory access, but might recover the application function.
You might look into the Erlang philosophy of "fail early" and the OTP library for further inspiration about this way of doing things. It does not handle SIGSEGV though, but several other types of problems.

Your program cannot recover from a segmentation fault because it has no idea what state anything is in.
Consider this analogy.
You have a nice house in Maine with a pretty front garden and a stepping stone path running across it. For whatever reason, you've chosen to connect each stone to the next with a ribbon (a.k.a. you've made them into a singly-linked list).
One morning, coming out of the house, you step onto the first stone, then follow the ribbon to the second, then again to the third but, when you step onto the fourth stone, you suddenly find yourself in Albuquerque.
Now tell us - how do you recover from that?
Your program has the same quandary.
Something went spectacularly wrong but your program has no idea what it was, or what caused it or how to do anything useful about it.
Hence: it crashes and burns.

It is absolutely possible, but this would duplicate existing functionality in a less stable way.
The kernel will already receive a page fault exception when a program accesses an address that is not yet backed by physical memory, and will then assign and potentially initialize a page according to the existing mappings, and then retry the offending instruction.
A hypothetical SEGV handler would do the exact same thing: decide what should be mapped at this address, create the mapping and retry the instruction -- but with the difference that if the handler would incur another SEGV, we could go into an endless loop here, and detection would be difficult since that decision would need to look into the code -- so we'd be creating a halting problem here.
The kernel already allocates memory pages lazily, allows file contents to be mapped and supports shared mappings with copy-on-write semantics, so there isn't much to gain from this mechanism.

So far, answers and comments have responded through the lens of a higher-level programming model, which fundamentally limits the creativity and potential of the programmer for their convenience. Said models define their own semantics and do not handle segmentation faults for their own reasons, whether simplicity, efficiency or anything else. From that perspective, a segfault is an unusual case that is indicative of programmer error, whether the userspace programmer or the programmer of the language's implementation. The question, however, is not about whether or not it's a good idea, nor is it asking for any of your thoughts on the matter.
In reality, what you say is correct: segmentation faults are recoverable. You can, as any regular signal, attach a handler for it with sigaction. And, yes, your program can most certainly be made in such a way that handling segmentation faults is a normal feature.
One obstacle is that a segmentation fault is a fault, not an exception, which is different in regards to where control flow returns to after the fault has been handled. Specifically, a fault handler returns to the same faulting instruction, which will continue to fault indefinitely. This isn't a real problem, though, as it can be skipped manually, you may return to a specified location, you may attempt to patch the faulting instruction into becoming correct or you may map said memory into existence if you trust the faulting code. With proper knowledge of the machine, nothing is stopping you, not even those spec-wielding knights.

When can the problem actually be fixed by catching an exception?

Here's the thing. There's something I don't quite understand about exceptions, and to me they seem like a construct that almost works, but can't be used cleanly.
I have a simple question. When has catching an exception been a useful or necessary component of solving the root cause of the problem? I.e. when have you been able to write code that fixes a problem signaled through an exception? I am looking for factual data, or experience you have had.
Here's what I mean. A normal program does work. If some piece of work can't be completed for reason X, the function responsible for doing the work throws an exception. But who catches the exception? As I see it, there are three reasons you might want to catch an exception:
You catch it because you want to change its type and rethrow it. (This happens when you translate mechanical exception, such as std::out_of_range, to business exceptions, such as could_not_complete_transaction)
You catch it because you want to log it, or let the user know about the problem, before aborting.
You catch it because you actually know how to solve the problem.
It is point 3 that I'm skeptical about. I have never actually caught an exception knowing what to do to solve it. When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory. That's just not something you can fix. And it's not just std::out_of_memory, there are also business class exceptions that suffer from this. Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
Now, to be fair, I do know of one case in which code does catch an exception and tries to fix the problem. I know that there are certain Win32 SEH handlers that catch a Stack Overflow exception and try to fix the problem by enlarging the size of the thread stack if it's possible. However, this works because SEH has try-resume semantics, which C++ exceptions don't have (you can't resume at the point the exception occurred).
The main part of the question is over. However, there's also another problem I have with exceptions that, to me, seems exactly the reason why you don't have catch clauses that fix the problem: the code that catches the exception necessarily has to be coupled with the code that throws it. Because, in order to fix the problem, it must have domain specific knowledge about what the problem cause is. But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
PS: Please note that this is not a "exceptions vs. error codes" kind of question; I am well aware that error codes suck as an error handling mechanism. They actually suffer from the same problem I have explained for exceptions.

I think your problem is that you equate "solve the problem" with "make the program keep going correctly". That is the wrong way to think of exceptions, or error handling in general.
Error handling code of any kind should not be something that is internally fixable by the program. That is, error handling logic (like catching exceptions) should not be entered because of programming mistakes.
If the user gives you a non-existent filename, that's not a programming mistake; that's a user-error. You cannot "fix" that without going back to the user and getting an existing file. But exceptions do allow you to undo what you were trying to do, restore the program to a valid state, and then communicate what happened to the user.
An invalid_connection is similarly not a programming mistake. Unlike the above, it's not necessarily a user error either. It's something that's expected to be able to happen, and different programs will handle it in different ways. Some will want to try again. Others will want to halt and let the user know.
The point is, because there is no one means to handle this condition, it cannot be done by the library. The error must be given to the caller of the library to figure out what to do.
If you have a function that parses integers, and you are given text that doesn't conform to an integer, it's not that function's job to figure out what to do next. The caller needs to be notified that the string they provided is malformed and that something ought to be done.
The caller needs to handle the error.
You don't abort most programs because a file that was supposed to contain integers didn't contain integers. But your parsing function does need to communicate this fact to the caller, and the caller does need to deal with that possibility.
That's what "catching exceptions" is for.
Now, unexpected environmental conditions like OOM are a different story. This is not usually external code's fault, but it's also not usually a programming error. And if it is a programming error (ie: memory leak), it's not one you can deal with in most cases. P0709 has an entire section on the ability (or lack thereof) of programs to be able to generally respond to OOM. The result is that, even when programs are coded defensively against OOM exceptions, they're usually still broken when they run out of memory.
Especially when dealing with OS's that don't commit pages to memory until you actually use them.

Here is my take,
There are more reasons to catch exceptions, for example, if it is a critical application, such as ones found in power substations etc. and an exception is caught to which there is no known system recovery or solution, you may want to have a controlled shutdown, protect certain modules, protect connected embedded systems etc. instead of just letting the system crash on its own. The latter could be disastrous...
I.e. when have you been able to write code that fixes a problem signaled through an exception?
When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory.
Actually I feel like that was my primary coding style for a while. An example: a system I worked on did not have a huge amount of memory and the system was dedicated, so, it was only my app and nothing else. Whenever I had an out_of_memory type of exception, I'd just kill the older process and open the one with the higher priority. Of course I'd wait for the kill to happen in a controlled fashion.
Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
I'd try to connect through another medium such as bluetooth, fiber, bus etc. Normally of course there would be a primary medium of contact, and the others wouldn't be called unless there is an exception.
But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
Most often an exception in a dedicated library has different consequences in your system than its own. You may not need to read the library and its internal workings to fix the problem. You just need to study its effect on your software and handle that situation. That's probably the easiest solution. And that is a lot easier to do if the library raises a known exception instead of just crashing or giving gibberish answers.

One obvious thing that came to mind was socket connections.
You try and connect to Server A and the program finds that it can't do that
Try connecting to Server B
The other examples regarding user input are equally as valid if not more so.
I admit that seeing something along the lines of
try
{
connectToServerA();
}
catch(cantConnectToServer)
{
connectToServerB();
}
would look like a bit of a weird pattern to see in real world code. It might make sense if the function takes an address and we iterate through a list of potential addresses.
Broadly speaking I agree with you often all you want to do is log the error and terminate - but some systems, which have to be robust and "always on" shouldn't just terminate if they encounter a problem.
Webservers are one obvious example. You don't just terminate because one users connection faulters, because that would drop the session for all the other connected users. There might be parts of code where raising an exception is the simplest way to deal with such a failure however.

Should exceptions ever be caught

No doubt exceptions are usefull as they show programmer where he's using functions incorrectly or something bad happens with an environment but is there a real need to catch them?
Not caught exceptions are terminating the program but you can still see where the problem is. In well designed libraries every "unexpected" situation has actually workaround. For example using map::find instead of map::at, checking whether your int variable is smaller than vector::size prior to using index operator.
Why would anyone need to do it (excluding people using libraries that enforce it)? Basically if you are writing a handler to given exception you could as well write a code that prevents it from happening.

Not all exceptions are fatal. They may be unusual and, therefore, "exceptions," but a point higher in the call stack can be implemented to either retry or move on. In this way, exceptions are used to unwind the stack and a nested series of function or method calls to a point in the program which can actually handle the cause of the exception -- even if only to clean up some resources, log an error, and continue on as before.

You can't always write code that prevents an exception. Just for an obvious example, consider concurrent code. Let's assume I attempt to verify that i is between (say) 0 and 20, then use i to index into some array. So, I check and i == 12, so I proceed to use it to index into the array. Unfortunately, in between the test and the indexing operation, some other thread added 20 to i, so by the time it's used as an index, it's not in range any more.
The concurrency has led to a race condition, so the attempt at assuring against an exceptional condition has failed. While it's possible to prevent this by (for example) wrapping each such test/use sequence in a critical section (or similar), it's often impractical to do so--first, getting the code correct will often be quite difficult, and second even if you do get it correct, the consequences on execution speed may be unacceptable.
Exceptions also decouple code that detects an exceptional condition from code that reacts to that exceptional condition. This is why exception handling is so popular with library writers. The code in the library doesn't have a clue of the correct way to react to a particular exceptional condition. Just for a really trivial example, let's assume it can't read from a file. Should it print a message to stderr, pop up a MessageBox, or write to a log?
In reality, it should do none of these. At least two (and possibly all three) will be wrong for any given program. So, what it should do is throw an exception, and let code at a higher level determine the appropriate way to respond. For one program it may make sense to log the error and continue with other work, but for another the file may be sufficiently critical that its only reasonable reaction is to abort execution entirely.

Exceptions are very expensive, performance vise - thus, whenever performance matter you will want to write an exception free code (using "plain C" techniques for error propagation).
However, if performance is not of immediate concern, then exceptions would allow you to develop a less cluttered code, as error handling can be postponed (but then you will have to deal with non-local transfer of control, which may be confusing in itself).

I have used extensivelly exceptions as a method to transfer control on specific positions depending on event handling.
Exceptions may also be a method to transfer control to a "labeled" position alog the tree of calling functions.
When an exception happens the code may be thought as backtracking one level at a time and checking if that level has an exception active and executing it.
The real problem with exceptions is that you don't really know where these will happen.
The code that arrives to an exception, usually doesn't know why there is a problem, so a fast returning back to a known state is a good action.
Let's make an example: You are in Venice and you look at the map walking throught small roads, at a moment you arrive somewhere that you aren't able to find in the map.
Essentially you are confused and you don't understand where you are.
If you have the ariadne "μιτος" you may go back to a known point and restart to try to arrive where you want.
I think you should treat error handling only as a control structure allowing to go back at any level signaled (by the error handling routine and the error code).

Should programs check for failure on WinAPI functions that "shouldn't", but can, fail?

Recently I was updating some code used to take screenshots using the GetWindowDC -> CreateCompatibleDC -> CreateCompatibleBitmap -> SelectObject -> BitBlt -> GetDIBits series of WinAPI functions. Now I check all those for failure because they can and sometimes do fail. But then I have to perform cleanup by deleting the created bitmap, deleting the created dc, and releasing the window dc. In any example I've seen -- even on MSDN -- the related functions (DeleteObject, DeleteDC< ReleaseDC) aren't checked for failure, presumably because if they were retrieved/created OK, they will always be deleted/released OK. But, they still can fail.
That's just one noteable example since the calls are all right next to each other. But occasionally there are other functions that can fail but in practice never do. Such as GetCursorPos. Or functions that can fail only if passed invalid data, such as FileTimeToSytemTime.
So, is it good-practice to check ALL functions that can fail for failure? Or are some OK not to check? And as a corollary, when checking these should-never-fail functions for failure, what is proper? Throwing a runtime exception, using an assert, something else?

The question whether to test or not depends on what you would do if it failed. Most samples exit once cleanup is finished, so verifying proper clean up serves no purpose, the program is exiting in either case.
Not checking something like GetCursorPos could lead to bugs, but depending on the code required to avoid this determines whether you should check or not. If checking it would add 3 lines around all your calls then you are likely better off to take the risk. However if you have a macro setup to handle it then it wouldn't hurt to add that macro just in case.
FileTimeToSystemTime being checked depends on what you are passing into it. A file time from the system? probably safe to ignore it. A custom string built from user input? probably better to make sure.

Yes. You never know when a promised service will surprise by not working. Best to report an error even for the surprises. Otherwise you will find yourself with a customer saying your application doesn't work, and the reason will be a complete mystery; you won't be able to respond in a timely, useful way to your customer and you both lose.
If you organize your code to always do such checks, it isn't that hard to add the next check to that next API you call.

It's funny that you mention GetCursorPos since that fails on Wow64 processes when the address passed is >2Gb. It fails every time. The bug was fixed in Windows 7.
So, yes, I think it's wise to check for errors even when you don't expect them.

Yes, you need to check, but if you're using C++ you can take advantage of RAII and leave cleanup to the various resources that you are using.
The alternative would be to have a jumble of if-else statements, and that's really ugly and error-prone.

Yes. Suppose you don't check what a function returned and the program just continues after the function failure. What happens next? How will you know why your program misbehaves long time later?
One quite reliable solution is to throw an exception, but this will require your code to be exception-safe.

Yes. If a function can fail, then you should protect against it.
One helpful way to categorise potential problems in code is by the potential causes of failure:
invalid operations in your code
invalid operations in client code (code that call yours, written by
someone else)
external dependencies (file system, network connection etc.)
In situation 1, it is enough to detect the error and not perform recovery, as this is a bug that should be fixable by you.
In situation 2, the error should be notified to client code (e.g. by throwing an exception).
In situation 3, your code should recover as far as possible automatically, and notify any client code if necessary.
In both situations 2 & 3, you should endeavour to make sure that your code recovers to a valid state, e.g. you should try to offer "strong exception guarentee" etc.

The longer I've coded with WinAPIs with C++ and to a lesser extent PInvoke and C#, the more I've gone about it this way:
Design the usage to assume it will fail (eventually) regardless of what the documentation seems to imply
Make sure to know the return value indication for pass/fail, as sometimes 0 means pass, and vice-versa
Check to see if GetLastError is noted, and decide what value that info can give your app
If robustness is a serious enough goal, you may consider it a worthy time-investment to see if you can do a somewhat fault-tolerant design with redundant means to get whatever it is you need. Many times with WinAPIs there's more than one way to get to the specific info or functionality you're looking for, and sometimes that means using other Windows libraries/frameworks that work in-conjunction with the WinAPIs.
For example, getting screen data can be done with straight WinAPIs, but a popular alternative is to use GDI+, which plays well with WinAPIs.

Is it not possible to make a C++ application "Crash Proof"?

Let's say we have an SDK in C++ that accepts some binary data (like a picture) and does something. Is it not possible to make this SDK "crash-proof"? By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
I have no experience with C++, but when I googled, I found several means that sounded like a solution (use a vector instead of an array, configure the compiler so that automatic bounds check is performed, etc.).
When I presented this to the developer, he said it is still not possible.. Not that I don't believe him, but if so, how is language like Java handling this? I thought the JVM performs everytime a bounds check. If so, why can't one do the same thing in C++ manually?
UPDATE
By "Crash proof" I don't mean that the application does not terminate. I mean it should not abruptly terminate without information of what happened (I mean it will dump core etc., but is it not possible to display a message like "Argument x was not valid" etc.?)

You can check the bounds of an array in C++, std::vector::at does this automatically.
This doesn't make your app crash proof, you are still allowed to deliberately shoot yourself in the foot but nothing in C++ forces you to pull the trigger.

No. Even assuming your code is bug free. For one, I have looked at many a crash reports automatically submitted and I can assure you that the quality of the hardware out there is much bellow what most developers expect. Bit flips are all too common on commodity machines and cause random AVs. And, even if you are prepared to handle access violations, there are certain exceptions that the OS has no choice but to terminate the process, for example failure to commit a stack guard page.

By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
This is what usually happens. If you access some invalid memory usually OS aborts your program.
However the question what is invalid memory... You may freely fill with garbage all the memory in heap and stack and this is valid from OS point of view, it would not be valid from your point of view as you created garbage.
Basically - you need to check the input data carefully and relay on this. No OS would do this for you.
If you check your input data carefully you would likely to manage the data ok.

I primarily mean forceful termination
by the OS upon memory access
violation, due to invalid input passed
by the user
Not sure who "the user" is.
You can write programs that won't crash due to invalid end-user input. On some systems, you can be forcefully terminated due to using too much memory (or because some other program is using too much memory). And as Remus says, there is no language which can fully protect you against hardware failures. But those things depend on factors other than the bytes of data provided by the user.
What you can't easily do in C++ is prove that your program won't crash due to invalid input, or go wrong in even worse ways, creating serious security flaws. So sometimes[*] you think that your code is safe against any input, but it turns out not to be. Your developer might mean this.
If your code is a function that takes for example a pointer to the image data, then there's nothing to stop the caller passing you some invalid pointer value:
char *image_data = malloc(1);
free(image_data);
image_processing_function(image_data);
So the function on its own can't be "crash-proof", it requires that the rest of the program doesn't do anything to make it crash. Your developer also might mean this, so perhaps you should ask him to clarify.
Java deals with this specific issue by making it impossible to create an invalid reference - you don't get to manually free memory in Java, so in particular you can't retain a reference to it after doing so. It deals with a lot of other specific issues in other ways, so that the situations which are "undefined behavior" in C++, and might well cause a crash, will do something different in Java (probably throw an exception).
[*] let's face it: in practice, in large software projects, "often".

I think this is a case of C++ codes not being managed codes.
Java, C# codes are managed, that is they are effectively executed by an Interpreter which is able to perform bound checking and detect crash conditions.
With the case of C++, you need to perform bound and other checking yourself. However, you have the luxury of using Exception Handling, which will prevent crash during events beyond your control.
The bottom line is, C++ codes themselves are not crash proof, but a good design and development can make them to be so.

In general, you can't make a C++ API crash-proof, but there are techniques that can be used to make it more robust. Off the top of my head (and by no means exhaustive) for your particular example:
Sanity check input data where possible
Buffer limit checks in the data processing code
Edge and corner case testing
Fuzz testing
Putting problem inputs in the unit test for regression avoidance

If "crash proof" only mean that you want to ensure that you have enough information to investigate crash after it occurred solution can be simple. Most cases when debugging information is lost during crash resulted from corruption and/or loss of stack data due to illegal memory operation by code running in one of threads. If you have few places where you call library or SDK that you don't trust you can simply save the stack trace right before making call into that library at some memory location pointed to by global variable that will be included into partial or full memory dump generated by system when your application crashes. On windows such functionality provided by CrtDbg API.On Linux you can use backtrace API - just search doc on show_stackframe(). If you loose your stack information you can then instruct your debugger to use that location in memory as top of the stack after you loaded your dump file. Well it is not very simple after all, but if you haunted by memory dumps without any clue what happened it may help.
Another trick often used in embedded applications is cycled memory buffer for detailed logging. Logging to the buffer is very cheap since it is never saved, but you can get idea on what happen milliseconds before crash by looking at content of the buffer in your memory dump after the crash.

Actually, using bounds checking makes your application more likely to crash!
This is good design because it means that if your program is working, it's that much more likely to be working /correctly/, rather than working incorrectly.
That said, a given application can't be made "crash proof", strictly speaking, until the Halting Problem has been solved. Good luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js