Error handling design problem on collection of items - c++

I have a collection of some items and some operation on them. This operation is a part of remote calls between client and server and it should run on all items at once. On server side it runs repeatedly on each item and may fail or succeed. I need to know which items succeeded and which failed. I guess this is rather common case and there are good solutions to it. How should I design it?

it should run on all items at once
You will hate your life if you don't read into this as a design requirement. All or nothing is the right way to handle it. It will simplify everything you do.
If that isn't an option, just do the dumbest thing possible. Wrap each call in a try/catch and give some report. Chances are no one will be able to consume the report, which is another reason all or nothing is the right thing to do.
edit:
To elaborate: When batching, writing simple logic to report errors is fine, but writing logic to recover from errors is very complicated. I've never seen a system really handle recovery well on batching. I'm sure there are some corner cases where each item is completely independent. At which point makes no matter that one or another failed, but that is usually not the case.
Generally, I expect any errors that happen during a batching operation to not be critical. By that I mean the system should be able to ignore errors and continue operating as if the message that caused the error never existed.
If it's really vital that these messages get processed, then I would definately try for all or nothing.

Related

When can the problem actually be fixed by catching an exception?

Here's the thing. There's something I don't quite understand about exceptions, and to me they seem like a construct that almost works, but can't be used cleanly.
I have a simple question. When has catching an exception been a useful or necessary component of solving the root cause of the problem? I.e. when have you been able to write code that fixes a problem signaled through an exception? I am looking for factual data, or experience you have had.
Here's what I mean. A normal program does work. If some piece of work can't be completed for reason X, the function responsible for doing the work throws an exception. But who catches the exception? As I see it, there are three reasons you might want to catch an exception:
You catch it because you want to change its type and rethrow it. (This happens when you translate mechanical exception, such as std::out_of_range, to business exceptions, such as could_not_complete_transaction)
You catch it because you want to log it, or let the user know about the problem, before aborting.
You catch it because you actually know how to solve the problem.
It is point 3 that I'm skeptical about. I have never actually caught an exception knowing what to do to solve it. When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory. That's just not something you can fix. And it's not just std::out_of_memory, there are also business class exceptions that suffer from this. Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
Now, to be fair, I do know of one case in which code does catch an exception and tries to fix the problem. I know that there are certain Win32 SEH handlers that catch a Stack Overflow exception and try to fix the problem by enlarging the size of the thread stack if it's possible. However, this works because SEH has try-resume semantics, which C++ exceptions don't have (you can't resume at the point the exception occurred).
The main part of the question is over. However, there's also another problem I have with exceptions that, to me, seems exactly the reason why you don't have catch clauses that fix the problem: the code that catches the exception necessarily has to be coupled with the code that throws it. Because, in order to fix the problem, it must have domain specific knowledge about what the problem cause is. But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
PS: Please note that this is not a "exceptions vs. error codes" kind of question; I am well aware that error codes suck as an error handling mechanism. They actually suffer from the same problem I have explained for exceptions.
I think your problem is that you equate "solve the problem" with "make the program keep going correctly". That is the wrong way to think of exceptions, or error handling in general.
Error handling code of any kind should not be something that is internally fixable by the program. That is, error handling logic (like catching exceptions) should not be entered because of programming mistakes.
If the user gives you a non-existent filename, that's not a programming mistake; that's a user-error. You cannot "fix" that without going back to the user and getting an existing file. But exceptions do allow you to undo what you were trying to do, restore the program to a valid state, and then communicate what happened to the user.
An invalid_connection is similarly not a programming mistake. Unlike the above, it's not necessarily a user error either. It's something that's expected to be able to happen, and different programs will handle it in different ways. Some will want to try again. Others will want to halt and let the user know.
The point is, because there is no one means to handle this condition, it cannot be done by the library. The error must be given to the caller of the library to figure out what to do.
If you have a function that parses integers, and you are given text that doesn't conform to an integer, it's not that function's job to figure out what to do next. The caller needs to be notified that the string they provided is malformed and that something ought to be done.
The caller needs to handle the error.
You don't abort most programs because a file that was supposed to contain integers didn't contain integers. But your parsing function does need to communicate this fact to the caller, and the caller does need to deal with that possibility.
That's what "catching exceptions" is for.
Now, unexpected environmental conditions like OOM are a different story. This is not usually external code's fault, but it's also not usually a programming error. And if it is a programming error (ie: memory leak), it's not one you can deal with in most cases. P0709 has an entire section on the ability (or lack thereof) of programs to be able to generally respond to OOM. The result is that, even when programs are coded defensively against OOM exceptions, they're usually still broken when they run out of memory.
Especially when dealing with OS's that don't commit pages to memory until you actually use them.
Here is my take,
There are more reasons to catch exceptions, for example, if it is a critical application, such as ones found in power substations etc. and an exception is caught to which there is no known system recovery or solution, you may want to have a controlled shutdown, protect certain modules, protect connected embedded systems etc. instead of just letting the system crash on its own. The latter could be disastrous...
I.e. when have you been able to write code that fixes a problem signaled through an exception?
When you get a std::out_of_memory, what are you supposed to do with it? It's not like you can barter the operating system to get more memory.
Actually I feel like that was my primary coding style for a while. An example: a system I worked on did not have a huge amount of memory and the system was dedicated, so, it was only my app and nothing else. Whenever I had an out_of_memory type of exception, I'd just kill the older process and open the one with the higher priority. Of course I'd wait for the kill to happen in a controlled fashion.
Think about a potential connection_error exception: what can you do to fix this except wait and retry later and hope it fixes itself?
I'd try to connect through another medium such as bluetooth, fiber, bus etc. Normally of course there would be a primary medium of contact, and the others wouldn't be called unless there is an exception.
But when some library documents that "if this function fails, an internal_error exception will be thrown", how am I supposed to be able to fix the problem when I don't know how the library works internally?
Most often an exception in a dedicated library has different consequences in your system than its own. You may not need to read the library and its internal workings to fix the problem. You just need to study its effect on your software and handle that situation. That's probably the easiest solution. And that is a lot easier to do if the library raises a known exception instead of just crashing or giving gibberish answers.
One obvious thing that came to mind was socket connections.
You try and connect to Server A and the program finds that it can't do that
Try connecting to Server B
The other examples regarding user input are equally as valid if not more so.
I admit that seeing something along the lines of
try
{
connectToServerA();
}
catch(cantConnectToServer)
{
connectToServerB();
}
would look like a bit of a weird pattern to see in real world code. It might make sense if the function takes an address and we iterate through a list of potential addresses.
Broadly speaking I agree with you often all you want to do is log the error and terminate - but some systems, which have to be robust and "always on" shouldn't just terminate if they encounter a problem.
Webservers are one obvious example. You don't just terminate because one users connection faulters, because that would drop the session for all the other connected users. There might be parts of code where raising an exception is the simplest way to deal with such a failure however.

Debugging crashes in production environments

First, I should give you a bit of context. The program in question is
a fairly typical server application implemented in C++. Across the
project, as well as in all of the underlying libraries, error
management is based on C++ exceptions.
My question is pertinent to dealing with unrecoverable errors and/or
programmer errors---the loose equivalent of "unchecked" Java
exceptions, for want of a better parallel. I am especially interested
in common practices for dealing with such conditions in production
environments.
For production environments in particular, two conflicting goals stand
out in the presence of the above class of errors: ease of debugging
and availability (in the sense of operational performance). Each of
these suggests in turn a specific strategy:
Install a top-level exception handler to absorb all uncaught
exceptions, thus ensuring continuous availability. Unfortunately,
this makes error inspection more involved, forcing the programmer to
rely on fine-grained logging or other code "instrumentation"
techniques.
Crash as hard as possible; this enables one to perform a post-mortem
analysis of the condition that led to the error via a core
dump. Naturally, one has to provide a means for the system to resume
operation in a timely manner after the crash, and this may be far
from trivial.
So I end-up with two half-baked solutions; I would like a compromise
between service availability and debugging facilities. What am I
missing ?
Note: I have flagged the question as C++ specific, as I am interested
in solutions and idiosyncrasies that apply to it particular;
nonetheless, I am aware there will be considerable overlap with other
languages/environments.
Disclaimer: Much like the OP I code for servers, thus this entire answer is focused on this specific use case. The strategy for embedded software or deployed applications should probably be widely different, no idea.
First of all, there are two important (and rather different) aspects to this question:
Easing investigation (as much as possible)
Ensuring recovery
Let us treat both separately, as dividing is conquering. And let's start by the tougher bit.
Ensuring Recovery
The main issue with C++/Java style of try/catch is that it is extremely easy to corrupt your environment because try and catch can mutate what is outside their own scope. Note: contrast to Rust and Go in which a task should not share mutable data with other tasks and a fail will kill the whole task without hope of recovery.
As a result, there are 3 recovery situations:
unrecoverable: the process memory is corrupted beyond repairs
recoverable, manually: the process can be salvaged in the top-level handler at the cost of reinitializing a substantial part of its memory (caches, ...)
recoverable, automatically: okay, once we reach the top-level handler, the process is ready to be used again
An completely unrecoverable error is best addressed by crashing. Actually, in a number of cases (such as a pointer outside your process memory), the OS will help in making it crash. Unfortunately, in some cases it won't (a dangling pointer may still point within your process memory), that's how memory corruptions happen. Oops. Valgrind, Asan, Purify, etc... are tools designed to help you catch those unfortunate errors as early as possible; the debugger will assist (somewhat) for those which make it past that stage.
An error that can be recovered, but requires manual cleanup, is annoying. You will forget to clean in some rarely hit cases. Thus it should be statically prevented. A simple transformation (moving caches inside the scope of the top-level handler) allows you to transform this into an automatically recoverable situation.
In the latter case, obviously, you can just catch, log, and resume your process, waiting for the next query. Your goal should be for this to be the only situation occurring in Production (cookie points if it does not even occur).
Easing Investigation
Note: I will take the opportunity to promote a project by Mozilla called rr which could really, really, help investigating once it matures. Check the quick note at the end of this section.
Without surprise, in order to investigate you will need data. Preferably, as much as possible, and well ordered/labelled.
There are two (practiced) ways to obtain data:
continuous logging, so that when an exception occurs, you have as much context as possible
exception logging, so that upon an exception, you log as much as possible
Logging continuously implies performance overhead and (when everything goes right) a flood of useless logs. On the other hand, exception logging implies having enough trust in the system ability to perform some actions in case of exceptions (which in case of bad_alloc... oh well).
In general, I would advise a mix of both.
Continuous Logging
Each log should contain:
a timestamp (as precise as possible)
(possibly) the server name, the process ID and thread ID
(possibly) a query/session correlator
the filename, line number and function name of where this log came from
of course, a message, which should contain dynamic information (if you have a static message, you can probably enrich it with dynamic information)
What is worth logging ?
At least I/O. All inputs, at least, and outputs can help spotting the first deviation from expected behavior. I/O include: inbound query and corresponding response, as well as interactions with other servers, databases, various local caches, timestamps (for time-related decisions), ...
The goal of such logging is to be able to reproduce the issue spotted in a control environment (which can be setup thanks to all this information). As a bonus, it can be useful as crude performance monitor since it gives some check-points during the process (note: I am talking about monitoring and not profiling for a reason, this can allow you to raise alerts and spot where, roughly, time is spent, but you will need more advanced analysis to understand why).
Exception Logging
The other option is to enrich exception. As an example of a crude exception: std::out_of_range yields the follow reason (from what): vector::_M_range_check when thrown from libstdc++'s vector.
This is pretty much useless if, like me, vector is your container of choice and therefore there are about 3,640 locations in your code where this could have been thrown.
The basics, to get a useful exception, are:
a precise message: "access to index 32 in vector of size 4" is slightly more helpful, no ?
a call stack: it requires platform specific code to retrieve it, though, but can be automatically inserted in your base exception constructor, so go for it!
Note: once you have a call-stack in your exceptions, you will quickly find yourself addicted and wrapping lesser-abled 3rd party software into an adapter layer if only to translate their exceptions into yours; we all did it ;)
On top of those basics, there is a very interesting feature of RAII: attaching notes to the current exception during unwinding. A simple handler retaining a reference to a variable and checking whether an exception is unwinding in its destructor costs only a single if check in general, and does all the important logging when unwinding (but then, exception propagation is costly already, so...).
Finally, you can also enrich and rethrow in catch clauses, but this quickly litters the code with try/catch blocks so I advise using RAII instead.
Note: there is a reason that std exceptions do NOT allocate memory, it allows throwing exceptions without the throw being itself preempted by a std::bad_alloc; I advise to consciously pick having richer exceptions in general with the potential of a std::bad_alloc thrown when attempting to create an exception (which I have yet to see happening). You have to make your own choice.
And Delayed Logging ?
The idea behind delayed logging is that instead of calling your log handler, as usual, you will instead defer logging all finer-grained traces and only get to them in case of issue (aka, exception).
The idea, therefore, is to split logging:
important information is logged immediately
finer-grained information is written to a scratch-pad, which can be called to log them in case of exception
Of course, there are questions:
the scratch pad is (mostly) lost in case of crash; you should be able to access it via your debugger if you get a memory dump though it's not as pleasant.
the scratch pad requires a policy: when to discard it ? (end of the session ? end of the transaction ? ...), how much memory ? (as much as it wants ? bounded ? ...)
what of the performance cost: even if not writing the logs to disk/network, it still cost to format them!
I have actually never used such a scratch pad, for now all non-crasher bugs that I ever had were solved solely using I/O logging and rich exceptions. Still, should I implement it I would recommend making it:
transaction local: since I/O is logged, we should not need more insight that this
memory bounded: evicting older traces as we progress
log-level driven: just as regular logging, I would want to be able to only enable some logs to get into the scratch pad
And Conditional / Probabilistic Logging ?
Writing one trace every N is not really interesting; it's actually more confusing than anything. On the other hand, logging in-depth one transaction every N can help!
The idea here is to reduce the amount of logs written, in general, whilst still getting a chance to observe bugs traces in detail in the wild. The reduction is generally driven by the logging infrastructure constraints (there is a cost to transferring and writing all those bytes) or by the performance of the software (formatting the logs slows software down).
The idea of probabilistic logging is to "flip a coin" at the start of each session/transaction to decide whether it'll be a fast one or a slow one :)
A similar idea (conditional logging) is to read a special debug field in a transaction field that initiates a full logging (at the cost of speed).
A quick note on rr
With an overhead of only 20%, and this overhead applying only on the CPU processing, it might actually be worth using rr systematically. If this is not feasible, however, it could be feasible to have 1 out of N servers being launched under rr and used to catch hard to find bugs.
This is similar to A/B testing, but for debugging purposes, and can be driven either by a willing commitment of the client (flag in the transaction) or with a probabilistic approach.
Oh, and in the general case, when you are not hunting down anything, it can be easily deactivated altogether. No sense in paying those 20% then.
That's all folks
I could apologize for the lengthy read, but the truth I probably just skimmed the topic. Error Recovery is hard. I would appreciate comments and remarks, to help improve this answer.
If the error is unrecoverable, by definition there is nothing the application can do in production environment, to recover from the error. In other words, the top-level exception handler is not really a solution. Even if the application displays a friendly message like "access violation", "possible memory corruption", etc, that doesn't actually increase availability.
When the application crashes in a production environment, you should get as much information as possible for post-mortem analysis (your second solution).
That said, if you get unrecoverable errors in a production environment, the main problems are your product QA process (it's lacking), and (much before that), writing unsafe/untested code.
When you finish investigating such a crash, you should not only fix the code, but fix your development process so that such crashes are no longer possible (i.e. if the corruption is an uninitialized pointer write, go over your code base and initialize all pointers and so on).

Should exceptions ever be caught

No doubt exceptions are usefull as they show programmer where he's using functions incorrectly or something bad happens with an environment but is there a real need to catch them?
Not caught exceptions are terminating the program but you can still see where the problem is. In well designed libraries every "unexpected" situation has actually workaround. For example using map::find instead of map::at, checking whether your int variable is smaller than vector::size prior to using index operator.
Why would anyone need to do it (excluding people using libraries that enforce it)? Basically if you are writing a handler to given exception you could as well write a code that prevents it from happening.
Not all exceptions are fatal. They may be unusual and, therefore, "exceptions," but a point higher in the call stack can be implemented to either retry or move on. In this way, exceptions are used to unwind the stack and a nested series of function or method calls to a point in the program which can actually handle the cause of the exception -- even if only to clean up some resources, log an error, and continue on as before.
You can't always write code that prevents an exception. Just for an obvious example, consider concurrent code. Let's assume I attempt to verify that i is between (say) 0 and 20, then use i to index into some array. So, I check and i == 12, so I proceed to use it to index into the array. Unfortunately, in between the test and the indexing operation, some other thread added 20 to i, so by the time it's used as an index, it's not in range any more.
The concurrency has led to a race condition, so the attempt at assuring against an exceptional condition has failed. While it's possible to prevent this by (for example) wrapping each such test/use sequence in a critical section (or similar), it's often impractical to do so--first, getting the code correct will often be quite difficult, and second even if you do get it correct, the consequences on execution speed may be unacceptable.
Exceptions also decouple code that detects an exceptional condition from code that reacts to that exceptional condition. This is why exception handling is so popular with library writers. The code in the library doesn't have a clue of the correct way to react to a particular exceptional condition. Just for a really trivial example, let's assume it can't read from a file. Should it print a message to stderr, pop up a MessageBox, or write to a log?
In reality, it should do none of these. At least two (and possibly all three) will be wrong for any given program. So, what it should do is throw an exception, and let code at a higher level determine the appropriate way to respond. For one program it may make sense to log the error and continue with other work, but for another the file may be sufficiently critical that its only reasonable reaction is to abort execution entirely.
Exceptions are very expensive, performance vise - thus, whenever performance matter you will want to write an exception free code (using "plain C" techniques for error propagation).
However, if performance is not of immediate concern, then exceptions would allow you to develop a less cluttered code, as error handling can be postponed (but then you will have to deal with non-local transfer of control, which may be confusing in itself).
I have used extensivelly exceptions as a method to transfer control on specific positions depending on event handling.
Exceptions may also be a method to transfer control to a "labeled" position alog the tree of calling functions.
When an exception happens the code may be thought as backtracking one level at a time and checking if that level has an exception active and executing it.
The real problem with exceptions is that you don't really know where these will happen.
The code that arrives to an exception, usually doesn't know why there is a problem, so a fast returning back to a known state is a good action.
Let's make an example: You are in Venice and you look at the map walking throught small roads, at a moment you arrive somewhere that you aren't able to find in the map.
Essentially you are confused and you don't understand where you are.
If you have the ariadne "μιτος" you may go back to a known point and restart to try to arrive where you want.
I think you should treat error handling only as a control structure allowing to go back at any level signaled (by the error handling routine and the error code).

Should programs check for failure on WinAPI functions that "shouldn't", but can, fail?

Recently I was updating some code used to take screenshots using the GetWindowDC -> CreateCompatibleDC -> CreateCompatibleBitmap -> SelectObject -> BitBlt -> GetDIBits series of WinAPI functions. Now I check all those for failure because they can and sometimes do fail. But then I have to perform cleanup by deleting the created bitmap, deleting the created dc, and releasing the window dc. In any example I've seen -- even on MSDN -- the related functions (DeleteObject, DeleteDC< ReleaseDC) aren't checked for failure, presumably because if they were retrieved/created OK, they will always be deleted/released OK. But, they still can fail.
That's just one noteable example since the calls are all right next to each other. But occasionally there are other functions that can fail but in practice never do. Such as GetCursorPos. Or functions that can fail only if passed invalid data, such as FileTimeToSytemTime.
So, is it good-practice to check ALL functions that can fail for failure? Or are some OK not to check? And as a corollary, when checking these should-never-fail functions for failure, what is proper? Throwing a runtime exception, using an assert, something else?
The question whether to test or not depends on what you would do if it failed. Most samples exit once cleanup is finished, so verifying proper clean up serves no purpose, the program is exiting in either case.
Not checking something like GetCursorPos could lead to bugs, but depending on the code required to avoid this determines whether you should check or not. If checking it would add 3 lines around all your calls then you are likely better off to take the risk. However if you have a macro setup to handle it then it wouldn't hurt to add that macro just in case.
FileTimeToSystemTime being checked depends on what you are passing into it. A file time from the system? probably safe to ignore it. A custom string built from user input? probably better to make sure.
Yes. You never know when a promised service will surprise by not working. Best to report an error even for the surprises. Otherwise you will find yourself with a customer saying your application doesn't work, and the reason will be a complete mystery; you won't be able to respond in a timely, useful way to your customer and you both lose.
If you organize your code to always do such checks, it isn't that hard to add the next check to that next API you call.
It's funny that you mention GetCursorPos since that fails on Wow64 processes when the address passed is >2Gb. It fails every time. The bug was fixed in Windows 7.
So, yes, I think it's wise to check for errors even when you don't expect them.
Yes, you need to check, but if you're using C++ you can take advantage of RAII and leave cleanup to the various resources that you are using.
The alternative would be to have a jumble of if-else statements, and that's really ugly and error-prone.
Yes. Suppose you don't check what a function returned and the program just continues after the function failure. What happens next? How will you know why your program misbehaves long time later?
One quite reliable solution is to throw an exception, but this will require your code to be exception-safe.
Yes. If a function can fail, then you should protect against it.
One helpful way to categorise potential problems in code is by the potential causes of failure:
invalid operations in your code
invalid operations in client code (code that call yours, written by
someone else)
external dependencies (file system, network connection etc.)
In situation 1, it is enough to detect the error and not perform recovery, as this is a bug that should be fixable by you.
In situation 2, the error should be notified to client code (e.g. by throwing an exception).
In situation 3, your code should recover as far as possible automatically, and notify any client code if necessary.
In both situations 2 & 3, you should endeavour to make sure that your code recovers to a valid state, e.g. you should try to offer "strong exception guarentee" etc.
The longer I've coded with WinAPIs with C++ and to a lesser extent PInvoke and C#, the more I've gone about it this way:
Design the usage to assume it will fail (eventually) regardless of what the documentation seems to imply
Make sure to know the return value indication for pass/fail, as sometimes 0 means pass, and vice-versa
Check to see if GetLastError is noted, and decide what value that info can give your app
If robustness is a serious enough goal, you may consider it a worthy time-investment to see if you can do a somewhat fault-tolerant design with redundant means to get whatever it is you need. Many times with WinAPIs there's more than one way to get to the specific info or functionality you're looking for, and sometimes that means using other Windows libraries/frameworks that work in-conjunction with the WinAPIs.
For example, getting screen data can be done with straight WinAPIs, but a popular alternative is to use GDI+, which plays well with WinAPIs.

What causes a nul character written to file just before a pc crash?

We have an application running on several thousand identical machines. Same OS, same hardware, same application installation. On very rare occasions, the machine locks up. Alt tab, ctrl-alt-del, application are all unresponsive. After inspecting our applications log file, a series of null characters are written to the end, as the last data before the crash.
I'm hoping to use this fact as a means to debug the lockup. My guess is that the number of null characters written is equivalent to the space I need to allocate for my log statement, but the content is never actually written to disk. I'm also guessing a disk IO problem occurred, prevent the write, and of course, the OS lockup. I can't confirm of this. So I guess my question is - have you ever seen a condition like this, how did it occur, and how might you go about troubleshooting it?
NTFS does not journal data (only metadata), so things like that can happen. The reason why is just that at the time of the crash/hang, the metadata (file size, data block allocation) was committed, but not the data (data block contents). Unfortunately this is normal behavior with NTFS and will not give you any insight into the problem causing the hang.
So the answer is: a crash at the "right" time can cause this.
BTW: The same thing can of course happen with FAT/FAT32.
I've seen this type of thing happen, I think you're looking in the right general direction.
When this happens I assume you're able to pinpoint the exact hardware? after failure I'd recommend running a memtest (http://www.memtest.org/).
I've seen this sort of thing with power supplies, bad disk controllers, etc. You can go insane trying to track them down.
Seems like you're going about this the right way - see if you can find a way to force the problem to happen more quickly, when it happens run the memtest, run chkdsk /R (check the eventlog for controller errors during this)
any chance you could get a kernel debugger attached?
any chance %SystemRoot%\memory.dmp was produced?