How do C++ developers capture programmatic errors in release builds? - c++

I have a C++ application that crashes with segfault with some unknown customer data. Customer refuses to share his input data. Is it possible to figure out where did error happen?
When Java application crashes on end-user side it usually produces a stack trace that can help developer to figure out where is the error in program and what program invariants where broken.
But what should C++ developer do in this case? Should I recompile application with some compiler option so it provides some diagnostics when error happens?

If you don't have the input data required to recreate the problem (for whatever reason...including difficult customers) and you don't have core/minidumps, there is not much you can do. I've been in many situations such as this. My recourse was to recreate what I thought was the execution path based on interviewing the customer and then just do a meticulous code review to find possibilities of error conditions. I would test every candidate condition and eventually find the problem. This is painful, time consuming, and the main prerequisite is that you are able to read code nearly like you're reading your native language.
Begin Story Time
I worked somewhere that had a crash bug randomly manifest in a multi-tenant system. No amount of logging, core dumps, etc. would help us find it. Finally I reviewed the code (line. by. line. for multiple thousands of lines) and noticed that the developer was constructing a std::string instance from a char* sequence passed to the ctor. It was DEEP down in the parts of the code that hardly ever changed, so correlating the issue to recent changes was just a set of false leads. I asked the developer, "Are all your char arrays null terminated?" Answer: "No." Me: "Well we are then randomly reading memory until it finds a null, and apparently sometimes the heap has a lot of contiguous non-zero memory." Handling the char array bounds differently resulted in fixing the problem.
End Story Time
While you can't find a single way to find all bugs, there is a defensive design you can apply that is quite simple. Most people put it in the code once they get burned by this type of situation. The approach is to add support for different levels of logging verbosity and essentially instrument your code with log outputs that don't execute unless the code is set to use the correct level of verbosity. Turning the verbosity level up until the bug is recreated gives you at least some idea of where it is happening. Often customers will not have a problem sharing redacted log data (assuming there is sensitive data in the logs). Load the logs in Splunk or something similar (if the customer doesn't already aggregate their logs in an analysis tool) and you'll have an easier time reviewing the data.
Unfortunately with C++ you don't get nice stack traces and post-mortem data for free (in general). You have to add these post-mortem troubleshooting capabilities into your design up front. Most of the design gets driven from the expected deployment environment and user personas of your code, so add "difficult customer" as a persona and start coding. :)

Related

C++ legacy application trying to find which part is accumulating memory

i have large legacy app that is written in c++ , some part in the application is
accumulating memory in same part or parts in the app . when i triger event in the app
i can't step in debug this part . there are many a sync in process calls and its
very hard to follow the flow of the app.
i need to find where are the containers that keeping the data and not getting free
every time i trigger this event .
what are the recommended tools or methods to help me find the leak ?
i tried to use:
visual leak detector and c++ Memory validator but it is very hard to find where the problem is.
To paraphrase, all well-engineered programs are alike; each ancient thrice-ported piece of legacy code is awful in its own way.
Your major tools in the tool chest, which are broadly independent of your development environment are:
Simplify: strip out or replace with NOPs or mock up with trivial implementations as many parts of the code as you can, while retaining the bad behaviour. This will help to remove confounding details.
Instrument: check your heap state before and after every call; write debugging info from the constructors and destructors of objects, spitting out their locations, to see where the leaked resources are being allocated; etc.
Contracts: implement pre- and post-condition checks on methods, that check for correct resource usage state.
Binary search: use a function which checks an invariant that reflects correct resource usage, and use binary subdivision within the problematic section of code to find where it is being violated.
What works in your case is of course hugely dependent upon the details of the code.

Embedded Lua - timing out rogue scripts (e.g. infinite loop) - an example anyone?

I have embedded Lua in a C++ application. I need to be able to kill rogue (i.e. badly written scripts) from hogging resources.
I know I will not be able to cater for EVERY type of condition that causes a script to run indefinitely, so for now, I am only looking at the straightforward Lua side (i.e. scripting side problems).
I also know that this question has been asked (in various guises) here on SO. Probably the reason why it is constantly being re-asked is that as yet, no one has provided a few lines of code to show how the timeout (for the simple cases like the one I described above), may actually be implemented in working code - rather than talking in generalities, about how it may be implemented.
If anyone has actually implemented this type of functionality in a C++ with embedded Lua application, I (as well as many other people - I'm sure), will be very grateful for a little snippet that shows:
How a timeout can be set (in the C++ side) before running a Lua script
How to raise the timeout event/error (C++ /Lua?)
How to handle the error event/exception (C++ side)
Such a snippet (even pseudocode) would be VERY, VERY useful indeed
You need to address this with a combination of techniques. First, you need to establish a suitable sandbox for the untrusted scripts, with an environment that provides only those global variables and functions that are safe and needed. Second, you need to provide for limitations on memory and CPU usage. Third, you need to explicitly refuse to load pre-compiled bytecode from untrusted sources.
The first point is straightforward to address. There is a fair amount of discussion of sandboxing Lua available at the Lua users wiki, on the mailing list, and here at SO. You are almost certainly already doing this part if you are aware that some scripts are more trusted than others.
The second point is question you are asking. I'll come back to that in a moment.
The third point has been discussed at the mailing list, but may not have been made very clearly in other media. It has turned out that there are a number of vulnerabilities in the Lua core that are difficult or impossible to address, but which depend on "incorrect" bytecode to exercise. That is, they cannot be exercised from Lua source code, only from pre-compiled and carefully patched byte code. It is straightforward to write a loader that refuses to load any binary bytecode at all.
With those points out of the way, that leaves the question of a denial of service attack either through CPU consumption, memory consumption, or both. First, the bad news. There are no perfect techniques to prevent this. That said, one of the most reliable approaches is to push the Lua interpreter into a separate process and use your platform's security and quota features to limit the capabilities of that process. In the worst case, the run-away process can be killed, with no harm done to the main application. That technique is used by recent versions of Firefox to contain the side-effects of bugs in plugins, so it isn't necessarily as crazy an idea as it sounds.
One interesting complete example is the Lua Live Demo. This is a web page where you can enter Lua sample code, execute it on the server, and see the results. Since the scripts can be entered anonymously from anywhere, they are clearly untrusted. This web application appears to be as secure as can be asked for. Its source kit is available for download from one of the authors of Lua.
Snippet is not a proper use of terminology for what an implementation of this functionality would entail, and that is why you have not seen one. You could use debug hooks to provide callbacks during execution of Lua code. However, interrupting that process after a timeout is non-trivial and dependent upon your specific architecture.
You could consider using a longjmp to a jump buffer set just prior to the lua_call or lua_pcall after catching a time out in a luaHook. Then close that Lua context and handle the exception. The timeout could be implemented numerous ways and you likely already have something in mind that is used elsewhere in your project.
The best way to accomplish this task is to run the interpreter in a separate process. Then use the provided operating system facilities to control the child process. Please refer to RBerteig's excellent answer for more information on that approach.
A very naive and simple, but all-lua, method of doing it, is
-- Limit may be in the millions range depending on your needs
setfenv(code,sandbox)
pcall (function() debug.sethook(
function() error ("Timeout!") end,"", limit)
code()
debug.sethook()
end);
I expect you can achieve the same through the C API.
However, there's a good number of problems with this method. Set the limit too low, and it can't do its job. Too high, and it's not really effective. (Can the chunk get run repeatedly?) Allow the code to call a function that blocks for a significant amount of time, and the above is meaningless. Allow it to do any kind of pcall, and it can trap the error on its own. And whatever other problems I haven't thought of yet. Here I'm also plain ignoring the warnings against using the debug library for anything (besides debugging).
Thus, if you want it reliable, you should probably go with RB's solution.
I expect it will work quite well against accidental infinite loops, the kind that beginning lua programmers are so fond of :P
For memory overuse, you could do the same with a function checking for increases in collectgarbage("count") at far smaller intervals; you'd have to merge them to get both.

How to approach debugging a huge not so familiar code base?

Seldom during working on large scale projects, suddenly you are moved on to a project which is already in maintainance phase.You end up with having a huge code C/C++ code base on your hands, with not much doccumentation about the design.The last person who could give you some knowledge transfer about the code has left the company already and to add to your horrors there is not enough time to get acquainted with the code and develop an understanding of the overall module/s.In this scenario when you are expected to fix bugs(core dumps,functionality,performance problems etc) on the module/s what is the approach that you will take?
So the question is:
What are your usual steps for debugging a not so familiar C/C++ code base when trying to fix a bug?
EDIT: Enviornment is Linux, but code is ported on Windows too so suggestions for both will be helpful.
If possible, step through it from main() to the problematic area, and follow the execution path. Along the way you'll get a good idea of how the different parts play together.
It could also be helpful to use a static code analysis tool, like CppDepends or even Doxygen, to figure out the relations between modules and be able to view them graphically.
Use a pen and paper, or images/graphs/charts in general, to figure out which parts belong where and draw some arrows and so on.
This helps you build and see the image that will then be refined in your mind as you become more comfortable with it.
I used a similar approach attacking a hellish system that had 10 singletons all #including each other. I had to redraw it a few times in order to fit everything, but seeing it in front of you helps.
It might also be useful to use Graphviz when constructing dependency graphs. That way you only have to list everything (in a text file) and then the tool will draw the (often unsightly) picture. (This is what I did for the #include dependencies in above syste,)
As others have already suggested, writing unit-tests is a great way to get into the codebase. There are a number of advantages to this approach:
It allows you to test your
assumptions about how the code
works. Adding a passing test proves
that your assumptions about that
small piece of code that you are
testing are correct. The more
passing tests you write, the better
you understand the code.
A failing unit test that reproduces
the bug you want to fix will pass
when you fix the bug and you know
that you have succeeded.
The unit tests that you write act as
documentation for the future.
The unit tests you write act as
regression tests as more bugs are
fixed.
Of course adding unit tests to legacy code is not always an easy task. Happily, a gentleman by the name of Michael Feathers has written an excellent book on the subject, which includes some great 'recipes' on adding tests to code bases without unit tests.
Some pointers:
Debug from the part which seems more
relevant to the workflow.
Use debug
strings
Get appropriate .pdb and attach the
core dump in debuggers like Windbg
or debugdiag to analyze it.
Get a person's help in your
organization who is good at
debugging. Even if he is new to your
codebase, he could be very helpful.
I had prior experience. They would
give you valuable pointers.
Per Assaf Lavie's advice, you could use static code analyzers.
The most important thing: as you
explore and debug, document
everything as you progress. At least
the person succeeding you would
suffer less.
Three things i don't see yet:
write some unit tests which use the libraries/interfaces. demonstrate/verify your understanding of them and promote their maintainability.
sometimes it is nice to create an special assertion macro to check that the other engineer's assumptions are in line with yours. you could:
not commit their uses
commit their uses, converting them to 'real' assertions after a given period
commit their uses, allowing another engineer (more familiar with the project) to dispose or promote them to real assertions
refactoring can also help. code that is difficult to read is an indication.
The first step should be try to read the code. Try to see the code where the bug is. Follow the code from main to that point ans try to see what could be wrong. Read the comments from the code(if any). Normally the function names are useful. Understand what each function does.
Once you get some idea of the code then you can start debugging the code. Put breakpoints where you don't understand the code or where you think the error can be. Start following the code line by line. Debugging is like sex. Initially painful, but slowly you start to enjoy it.
cscope + ctags are available on both Linux and Windows (via Cygwin). If you give them a chance, these tools will become indispensable to you. Although, IDEs like Visual Studio also do an excellent job with code browsing facilities as well.
In a situation like yours, because of time constraints, you are driven by symptoms. I mean that you don't have time to reconstruct the big picture / design / architecture. So you focus on the symptoms and work outwards, and each time reconstruct as much of the big picture as you need for that particular problem. But do not make "local" decisions in a hurry. Have the patience to see as much of the big picture as needed to make a good quality decision. And don't get caught in the band-aid syndrome i.e. put any old fix in that will work. It is your job to preserve the underlying architecture / design (if there is one, and to whatever extent that you can discover it).
It will be a struggle at first, as your mind "hunts" excessively. But soon the main themes in the design / architecture will emerge, and all of it will start to make sense. Think, by not thinking, grasshoppa :)
You have to have a fully reliable IDE which has a lot of debbugging tools (breakpoints, watches, and the like). The best way to familiarize yourself with a huge code is to play around with it and see how data is passed from one method to another. Also, you can reverse engineer the code so could see the relationship of the classes. :D Good Luck!
For me, there is only one way to get to know a process - Interaction. Identify the interfaces of the process/system. Then identify the input/output relationship (these steps maybe not linear). Once you do that, you can start tinkering at the code with a fair amount of confidence because you know what it is "supposed to do" then it's just a matter of finding out "how it is actually being done". For me though, getting to know the interface (Not necessarily the user interface) of the system is the key. To put it bluntly - Never touch the code first!!!
Not sure about C/C++, but coming from Java and C#, unit testing will help. In Java there's JUnit and TestNG libraries for unit testing, in C# there's NUnit and mstest. Not sure about C/C++.
Read the book 'Refactoring: Improving the Design of Existing Code' by Martin Fowler, Kent Beck, et al. Will be quite a few tips in there I'm sure that will help, and give you some guidance to improving the code.
One tip: if it aint broke, don't fix it. Don't bother trying to fix some library or really complicated function if it works. Focus on parts where there's bugs.
Write a unit test to reproduce the scenario where the code should work. The test will fail at first. Fix the code until the unit test passes successfully. Repeat :)
Once a majority of your code, the important bits that are too complex to manually debug and fix, is under automated unit tests, you'll have a safety harness of regression tests that'll make you feel more confident at changing the existing code base.
while (!codeUnderstood)
{
Breakpoints();
Run();
StepInto();
if(needed)
{
StepOver();
}
}
I don't try to get an overview of the whole system as suggested by many here. If there is something which needs fixing I learn the smallest part of the code I can to fix the bug. The next time there is an issue I'm a little more familiar and a little less daunted and I learn a little more. Eventually I'm able to support the whole shebang.
If management suggests I do a major change to something I'm not familiar with I make sure they understand the time scales and if things a really messy suggest a rewrite.
Usually the program in question will produce some kind of output ( log, console printout, dialog box ).
Find the closest place to your
problem in the program output
Search through the code base and look for the text in that output
Start putting your own printouts, nothing fancy, just printf( "Calling xxx\n" );, so you can pinpoint exactly to the point where the problem starts.
Once you pinpointed the problem spot, put a breakpoint
When you hit the breakpoint, print a stacktrace
Now you can see what players you have and start the analysis of how you've got to the wrong place.
Hopefully the names of the methods on the call stack are more meaningful than a, b and c ( seen this ), and there is some sort of comments, method documentation more meaningful than calling a ( seen this many times ).
If the source is poorly documented, don't be afraid to leave your comments once you have figured out what's going on. If program design permits it create a unit test for the problem you've fixed.
Thanks for the nice answers, quite a number of points to take up. I have worked on such situation a number of times and here is the usual procedure i follow:
Check the crash log or trace log. Check relevant trace if just a simple developer mistake if cannot evaluate in one go, then move on to 2.
Reproduce the bug! This is the most important thing to do. Some bugs are rare to occur and if you get to reproduce the bug nothing like it. It means you have a better % of cracking it.
If you cant reproduce a bug, find a alternative use case, situation where in you can actually reproduce the bug. Being able to actually debug a scenario is much more useful than just the crash log.
Head to version control! Check if the same buggy behavior exists on previous few SW versions. If NOT..Voila! You can find between what two versions the bug got introduced and You can easily get the code difference of the two versions and target the relevant area.(Sometimes it is not the newly added code which has the bug but it exposes some old leftovers.Well, We atleast have a start I would say!)
Enable the debug traces. Run the use case of the bug, check if you can find some additional information useful for investigation.
Get hold of the relevant code area through the trace log. Check out there for some code introducing the bug.
Put some breakpoints in the relevant code. Study the flow. Check the data flows.Lookout for pointers(usual culprits). Repeat till you get a hold of the flow.
If you have a SW version which does not reproduce the bug, compare what is different in the flows. Ask yourself, Whats the difference?
Still no Luck!- Arghh...My tricks have exhausted..Need to head the old way. Understand the code..and understand the code and understand it till you know what is happening in the code when that particular use case is being executed.
With newly developed understanding try debugging the code and sure the solution is around the corner.
Most important - Document the understanding you have developed about the module/s. Even small knitty gritty things. It is sure going to help you or someone just like you, someday..sometime!
You can try GNU cFlow tool (http://www.gnu.org/software/cflow/).
It will give you graph, charting control flow within program.

Based on your development stack, which is easier for you and why? Debugging or logging?

Please state if you are developing on the front end, back end, or if you are developing a mobile/desktop application.
List your development stack
Language, IDE, etc..
Unit Testing or no Unit Testing
Be sure to include any AOP frameworks if used.
Tell me if it is easier for you to use a debugger or to using logging during development, and why you feel it is easier.
I'm just trying to get a feel for why people choose to use a debugger or logging based on their development stack.
[Front end and Back end. Desktop]
As usual: it depends....
Debugging is better if you are investigating behaviour at a distinct place in the code and/or you don't know what objects you will need to inspect and you don't mind interfering with the natural speed/order of code flow
Logging is better if there is a known variable or variables you need to monitor often over a wide swath of the flow AND when you want the code to run naturally without interruptions. Logging is also a useful addition to unit testing.
It entirely depends on the type of problem. A lot of the work that I do currently is done on the back-end (C#, WCF-services). I typically find it easiest to use logging to get a rough idea on where and when a problem occurs, then I try to tailor a test that provokes the behaviour, and then use debugging in order to fix it.
I mainly use logging and unit testing, though I think my greatest weakness as a programmer is that I am not proficient in using gdp. I can do the basic stuff (breakpoints, watches) but don't really know enough to really tap into the power it really has.
I feel some discord in the question. Debugging—according to Wikipedia—is:
Debugging is a methodical process of
finding and reducing the number of
bugs, or defects, in a computer
program
Logging is an automatic writing of trace text records while program is running.
So I use logging as a part of debugging. And I think many people are. Otherwise, what are logs were made for? Well, maybe for further numeric analysis, but that's another story.

Theory on error handling?

Most advice concerning error handling boils down to a handful of tips and tricks (see this post for example). These hints are helpful but I think they don't answer all questions. I feel that I should design my application according to a certain philosophy, a school of thought that provides a strong foundation to build upon. Is there such a theory on the topic of error handling?
Here's a few practical questions:
How to decide if an error should be handled locally or propagated to higher level code?
How to decide between logging an error, or showing it as an error message to the user?
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
Does it make sense to create a list of error codes? Or is that old fashioned these days?
In many cases common sense is sufficient for developing a good-enough strategy to deal with error conditions. However, I would like to know if there is a more formal/"scholarly" approach?
PS: this is a general question, but C++ specific answers are welcome too (C++ is my main programming language for work).
Is logging something that should only
be done in application code? Or is it
ok to do some logging from library
code.
Just wanted to comment on this. My view is to never logg directly in the library code, but provide hooks or callbacks to implement this in the application code, so the application can decide what to do with the output from the log (if anything at all).
A couple of years ago I thought exactly about the same question :)
After searching and reading several things, I think that the most interesting reference I found was Patterns for Generation, Handling and Management of Errors from Andy Longshaw and Eoin Woods. It is a short and systematic attempt to cover the basic idioms you mention and some others.
The answer to these questions is quite controversial, but the authors above were brave enough to expose themselves in a conference, and then put their thoughts on paper.
Introduction
To understand what needs to be done for error handling, I think one needs clearly to understand the types of errors one encounters, and the contexts in which one encounters them.
To me, it has been extremely useful to consider the two major types of errors as:
Errors that should never happen, and are typically due to a bug in the code.
Errors which are expected and cannot be prevented in normal operation, such as a database connection going down because of a database issue over which the application has no control.
The way an error should be handled depends heavily on which type of error it is.
The differing contexts which affect how errors should be handled are:
Application code
Library code
The handling of errors in library code differs somewhat from the handling in application code.
A philosophy for handling of the two major types of errors is discussed below. The special considerations for library code are also addressed. Finally, the specific practical questions in the original post are addressed in the context of the philosophy presented.
Types of errors
Programming errors - bugs - and other errors that should never happen
Many errors are the result of programming mistakes. These errors typically cannot be corrected, since the specific programming mistake cannot be anticipated. That means we can't know in advance what condition the mistake leaves the application in, so we can't recover from that condition and shouldn't try.
Ultimately, the fix to this kind of error is to fix the programming mistake. To facilitate that, the error should be surfaced as quickly as possible. Ideally, the program should exit immediately after identifying such an error and providing the relevant information. A quick and obvious exit reduces the time required to complete the debug and retest cycle, permitting more bugs to be fixed in the same amount of testing time; that in turn results in having a more robust application with fewer bugs when it comes time to deploy.
The other major objective in handling this type of error should be to provide sufficient information to make it easy to identify the bug. In Java, for example, throwing a RuntimeException often provides sufficient information in the stack trace to identify the bug immediately; in clean code, immediate fixes can often be identified just from examining the stack trace. In other languages, one might log the call stack or otherwise preserve the necessary information. It is critical not to suppress information in the interests of brevity; don't worry about how much log space you are taking up when this type of error occurs. The more information that is provided, the quicker the bugs can be fixed, and the fewer bugs will remain to pollute the logs when the application makes it to production.
Server applications
Now, in some server applications, it's important that the server be sufficiently fault tolerant to continue operation even in the face of occasional programming errors. In this case, the best approach is to have a very clear separation between the server code that must continue operation and the task processing code that can be allowed to fail. For example, tasks can be relegated to threads or subprocesses, as is done in many web servers.
In such a server architecture, the thread or subprocess handling the task can then be treated like an application which can fail. All the considerations above apply to such a task: the error should be surfaced as quickly as possible by a clean exit from the task, and sufficient information should be logged to permit the bug to be easily found and fixed. When such a task exits in Java, for example, the entire stack trace of any RuntimeException causing the exit should normally be logged.
As much of the code as possible should be executed within the threads or processes handling the task, rather than in the main server thread or process. This is because any bug in the main server thread or process will still cause the entire server to go down. It's better to push the code - with the bugs it contains - into the task handling code where it won't cause a server crash when the bug manifests itself.
Errors that are expected and cannot be prevented in normal operation
Errors that are expected and cannot be prevented in normal operation, such as an exception from a database or other service separate from the application, require very different treatment. In these cases, the objective is not to fix the code, but rather to have the code handle the error when that makes sense, and inform users or operators who can fix the problem otherwise.
In these cases, for example, the application may wish to throw away any results that have accumulated thus far, and retry the operation. In database access, use of transactions can help ensure that accumulated data is discarded. In other cases, it can be useful to write one's code with such retries in mind. The concept of idempotency can also be useful here.
When automated retries won't sufficiently solve the problem, human beings should be informed. The user should be informed that the operation failed; often the user can be given the option of retrying. The user can then judge whether a retry is desirable, and can also make alterations in input that might help things go better on a retry.
For this type of error, logging and perhaps email notices can be used to inform system operators. Unlike logging of programming errors, logging of errors that are expected in normal operation should be more succinct, since the error may happen many times and appear many times in the logs; operators will often be analyzing the pattern of many errors, rather than focusing on one individual error.
Libraries and applications
The above discussion of types of errors is directly applicable to application code. The other major context for error handling is library code. Library code still has the same two basic types of errors, but it typically cannot or should not communicate directly with the user, and and it has less knowledge about the application context, including whether an immediate exit is acceptable, than does the application code.
As a result, there are differences in how libraries should handle logging, how they should handle errors that may be expected in normal operation, and how they should handle programming errors and other errors that should never happen.
With respect to logging, the library should if possible support logging in the format desired by the client application code. One valid approach is to do no logging at all, and allow the application code to do all logging based on error information provided to the application code by the library. Another approach is to use a configurable logging interface, allowing the client application to provide the implementation for the logging, for example when the library is first loaded. In Java, for example, the library might use the logback logging interface, and allow the application to worry about what logging implementation to configure for logback to use.
For bugs and other errors that should never happen, libraries still cannot simply exit the application, since that may not be acceptable to the application. Rather, libraries should exit the library call, providing the caller with sufficient information to help diagnose the problem. The information may be provided in the form of an exception with a stack trace, or the library may log the information if the configurable logging approach is being used. The application can then treat this as it would any other error of this type, typically by exiting, or in a server, by allowing the task process or thread to exit, with the same logging or error reporting that would be done for programming errors in the application code.
Errors that are expected in normal operation should be also be reported to the client code. In this case, as with this type of error when encountered in the client code, the information associated with the error can be more succinct. Typically libraries should do less local handling of this type of error, relying more on the client code to decide things like whether to retry and for how many times. The client code can then pass along the retry decision to the user if desired.
Practical questions
Now that we have the philosophy, let's apply it to the practical questions you mention.
How to decide if an error should be handled locally or propagated to higher level code?
If it is an error that is expected in normal operation, retry or possibly consult the user locally. Otherwise, propagate it to higher level code.
How to decide between logging an error, or showing it as an error message to the user?
If it is an error that is expected in normal operation, and user input would be useful to determine what action to take, get user input and log a succinct message; if it seems to be a programming error, provide the user with a brief notification and log more extensive information.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
Logging from the library code should be under the control of the client code. At most, the library should log to an interface for which the client provides the implementation.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Exceptions that are expected in normal operation can be caught locally and the operation retried or otherwise handled. In all other cases, exceptions should be allowed to propagate.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
The types of errors in third party libraries are the same types of errors that occur in application code. Errors should be handled primarily according to which error type they represent, with relevant adjustments for library code.
Does it make sense to create a list of error codes? Or is that old fashioned these days?
Application code should provide a complete description of the error in the case of programming errors, and a succinct description in the case of errors that can occur in normal operation; in either case, a description is normally more appropriate than an error code. Libraries may provide an error code as a way of describing whether an error is a programming or other internal error, or whether the error is one which can occur in normal operation, with the latter type perhaps subdivided more finely; however, an exception hierarchy can be more useful than an error code in languages where such is possible. Note that applications run from the command line may act as libraries for shell scripts, however.
Disclaimer: I do not know any theory on error-handling, I did, however, thought repetitively about the subject as I explored various languages and programming paradigms, as well as toyed around with programming language designs (and discussed them). What follows, thus, is a summary of my experience so far; with objective arguments.
Note: this should cover all the questions, but I did not even try to address them in order, preferring instead a structured presentation. At the end of each section, I present a succinct answer to those questions it answered, for clarity.
Introduction
As a premise, I would like to note that whatever is subject to discussion some parameters must be kept in mind when designing a library (or reusable code).
The author cannot hope to fathom how this library will be used, and should thus avoid strategies that make integration more difficult than it should. The most glaring defect would be relying on globally shared state; thread-local shared state can also be a nightmare for interactions with coroutines/green-threads. The use of such coroutines and threads also highlight that synchronization best be left to the user, in single-threaded code it will mean none (best performance), whilst in coroutines and green-threads the user is best suited to implement (or use existing implementations of) dedicated synchronization mechanisms.
That being said, when library are for internal use only, global or thread-local variables might be convenient; if used, they should be clearly documented as a technical limitation.
Logging
There are many ways to log messages:
with extra information such as timestamp, process-ID, thread-ID, server name/IP, ...
via synchronous calls or with an asynchronous mechanism (and an overflow handling mechanism)
in files, databases, distributed databases, dedicated log-servers, ...
As the author of a library, the logs should be integrated within the client infrastructure (or turned off). This is best provided by allowing the client to provide hooks so as to deal with the logs themselves, my recommendation is:
to provide 2 hooks: one to decide whether to log or not, and one to actually log (the message being formatted and the latter hook called only when the client decided to log)
to provide, on top of the message: a severity (aka level), the filename, line and function name if open-source or otherwise the logical module (if several)
to, by default, write to stdout and stderr (depending on severity), until the client explicitly says not to log
I would note that, following the guidelines delineated in the introduction, synchronization is left to the client.
Regarding whether to log errors: do not log (as errors) what you otherwise already report via your API; you can however still log at a lower severity the details. The client can decide whether to report or not when handling the error, and for example choose not to report it if this was just a speculative call.
Note: some information should not make it into the logs and some other pieces are best obfuscated. For example, passwords should not be logged, and Credit-Card or Passport/Social Security Numbers are best obfuscated (partly at least). In a library designed for such sensitive information, this can be done during logging; otherwise the application should take care of this.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
Application code should decide the policy. Whether a library logs or not depends on whether it needs to.
Going on after an error ?
Before we actually talk about reporting errors, the first question we should ask is whether the error should be reported (for handling) or if things are so wrong that aborting the current process is clearly the best policy.
This is certainly a tricky topic. In general, I would advise to design such that going on is an option, with a purge/reset if necessary. If this cannot be achieved in certain cases, then those cases should provoke an abortion of the process.
Note: on some systems, it is possible to get a memory-dump of the process. If an application handles sensitive data (password, credit-cards, passports, ...), it is best deactivated in production (but can be used during development).
Note: it can be interesting to have a debug switch that transforms a portion of the error-reporting calls into abortions with a memory-dump to assist debugging during development.
Reporting an error
The occurrence of an error signifies that the contract of a function/interface could not be fulfilled. This has several consequences:
the client should be warned, which is why the error should be reported
no partially correct data should escape in the wild
The latter point will be treated later on; for now let us focus on reporting the error. The client should not, ever, be able to accidentally ignore this report. Which is why using error-codes is such an abomination (in languages when return values can be ignored):
ErrorStatus_t doit(Input const* input, Output* output);
I know of two schemes that require explicit action on the client part:
exceptions
result types (optional<T>, either<T, U>, ...)
The former is well-known, the latter is very much used in functional languages and was introduced in C++11 under the guise of std::future<T> though other implementations exist.
I advise to prefer the latter, when possible, as it easier to fathom, but revert to exceptions when no result is expected. Contrast:
Option<Value&> find(Key const&);
void updateName(Client::Id id, Client::Name name);
In the case of "write-only" operations such as updateName, the client has no use for a result. It could be introduced, but it would be easy to forget the check.
Reverting to exceptions also occur when a result type is impractical, or insufficient to convey the details:
Option<Value&> compute(RepositoryInterface&, Details...);
In such a case of externally defined callback, there is an almost infinite list of potential failures. The implementation could use the network, a database, the filesystem, ... in this case, and in order to report errors accurately:
the externally defined callback should be expected to report errors via exceptions when the interface is insufficient (or impractical) to convey the full details of the error.
the functions based on this abstract callback should be transparent to those exceptions (let them pass, unmodified)
The goal is to let this exception bubble up to the layer where the implementation of the interface was decided (at least), for it's only at this level that there is a chance to correctly interpret the exception thrown.
Note: the externally defined callback is not forced to use exceptions, we should just expect it might be using some.
Using an error
In order to use an error report, the client need enough information to take a decision. Structured information, such as error codes or exception types, should be preferred (for automatic actions) and additional information (message, stack, ...) can be provided in a non-structured way (for humans to investigate).
It would be best if a function clearly documented all possible failure modes: when they occur and how they are reported. However, especially in case arbitrary code is executed, the client should be prepared to deal with unknown codes/exceptions.
A notable exception is, of course, result types: boost::variant<Output, Error0, Error1, ...> provides a compiler-checked exhaustive list of known failure modes... though a function returning this type could still throw, of course.
How to decide between logging an error, or showing it as an error message to the user?
The user should always be warned when its order could not be fulfilled, however a user-friendly (understandable) message should be displayed. If possible, advices or work-arounds should be presented as well. Details are for investigating teams.
Recovering from an error ?
Last, but certainly not least, comes the truly frightening part about errors: recovery.
This is something that databases (real ones) are so good for: transaction-like semantics. If anything unexpected occurs, the transaction is aborted as if nothing had happened.
In the real world, things are not simple. The simple example of cancelling an e-mail sent pops to mind: too late. Protocols may exist, depending on your application domain, but this is out of this discussion. The first step, though, is the ability to recover a sane in-memory state; and that is far from being simple in most languages (and STM can only do so much today).
First of all, an illustration of the challenge:
void update(Client& client, Client::Name name, Client::Address address) {
client.update(std::move(name));
client.update(std::move(address)); // Throws
}
Now, after updating the address failed, I am left with a half-updated client. What can I do ?
attempting to undo all the updates that occurred is close to impossible (the undo might fail)
copying the state prior to executing any single update is a performance hog (supposing we can even swap it back in a sure way)
In any case, the book-keeping required is such that mistakes will creep in.
And worst of all: there is no safe assumption that can be made as to the extent of the corruption (except that client is now botched). Or at least, no assumption that will endure time (and code changes).
As often, the only way to win is not to play.
A possible solution: Transactions
Wherever possible, the key idea is to define macro functions, that will either fail or produce the expected result. Those are our transactions. And their form is invariant:
Either<Output, Error> doit(Input const&);
// or
Output doit(Input const&); // throw in case of error
A transaction does not modify any external state, thus if it fails to produce a result:
the external world has not changed (nothing to rollback)
there is no partial result to observe
Any function that is not a transaction should be considered as having corrupted anything it touched, and thus the only sane way of dealing with an error from non-transactional functions is to let it bubble up until a transaction layer is reached. Any attempt to deal with the error prior is, in the end, doomed to fail.
How to decide if an error should be handled locally or propagated to higher level code ?
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Deal with them whenever it is safe to do so and there is value in doing so. Most notably, it's okay to catch an error, check if it can be dealt with locally, and then either deal with it or pass it up.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
I did not address this question previously, however I believe it is clear than the approach I highlighted is already dual since it consists of both result-types and exceptions. As such, dealing with 3rd party libraries should be a cinch, though I do advise wrapping them anyway for other reasons (3rd party code is better insulated beyond a business-oriented interface tasked with the impedance adaption).
My view on logging (or other actions) from library code is NEVER.
A library should not impose policy on its user, and the user may have INTENDED an error to occur. Perhaps the program was deliberately soliciting a particular error, in the expectation of it arriving, to test some condition. Logging this error would be misleading.
Logging (or anything else) imposes policy on the caller, which is bad. Moreover, if a harmless error condition (which would be ignored or retried harmlessly by the caller, for example) were to happen with a high frequency, the volume of logs could mask any legitimate errors or cause robustness problems (filling discs, using excessive IO etc)
How to decide if an error should be handled locally or propagated to higher level code?
If the exception breaks the operation of a method it is a good approach to throw it to higher level. If you are familiar with MVC, Exceptions must be evaluated in Controller.
How to decide between logging an error, or showing it as an error message to the user?
Logging errors and all information available about the error is a good approach. If the error breaks the operation or user needs to know that an error is occur you should display it to user. Note that in a windows service logs are very very important.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
I don't see any reason to log errors in a dll. It should only throw errors. There may be a specific reason to do of course. In our company a dll logs information about the process (not only errors)
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Similar question: at what point should you stop propagating an error and deal with it?
In a controller.
Edit: I need to explain this a bit if you are not familiar with MVC. Model View Controller is a design pattern. In Model you develop application logic. In View you display content to user. In Controller you get user events and call Model for relevant function then invoke View to display result to the user.
Suppose that you have a form which has two textboxes and a label and a button named Add. As you might guess this is your view. Button_Click event is defined in Controller. And an add method is defined in Model. When user clicks, Button_Click event is triggered and Controller calls add method. Here textbox values can be empty or they can be letters instead of numbers. An exception occur in add function and this exception is thrown. Controller handles it. And displays error message in the label.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
I prefer second one. It would be easier. And I don't think you can do a general stuff for error handling. Especially for different libraries.
Does it make sense to create a list of error codes? Or is that old fashioned these days?
That depends on how will you use it. In a single application (a web site, a desktop application), i don't think it is needed. But if you develop a web service, how will you inform users for errors? Providing an error code is always important here.
If (error.Message == "User Login Failed")
{
//do something.
}
If (error.Code == "102")
{
//do something.
}
Which one do you prefer?
And there is another way for error codes these days:
If (error.Code == "LOGIN_ERROR_102") // wrong password
{
//do something.
}
The others may be: LOGIN_ERROR_103 (eg: this is user expired) etc...
This one is also human readable.
Here is an awesome blog post which explains how error handling should be done. http://damienkatz.net/2006/04/error_code_vs_e.html
How to decide if an error should be handled locally or propagated to higher level code?
Like Martin Becket says in another answer, this is a question of whether the error can be fixed here or not.
How to decide between logging an error, or showing it as an error message to the user?
You should probably never show an error to the user if you think so. Rather, show them a well formed message explaining the situation, without giving too much technical information. Then log the technical information, especially if it is an error while processing input. If your code doesn't know how to handle faulty input, then that MUST be fixed.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
Logging in library code is not useful, because you may not even have written it. However, the application could log interaction with the library code and even through statistics detect errors.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
See question one.
Similar question: at what point should you stop propagating an error and deal with it?
See question one.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
Throwing exceptions is an expensive operation in most heavy languages, so use them where the entire program flow is broken for that operation. On the other hand, if you can predict all outcomes of a function, put any data through a referenced variable passed as parameter to it, and return an error code (0 on success, 1+ on errors).
Does it make sense to create a list of error codes? Or is that old fashioned these days?
Make a list of error codes for a particular function, and document it inside it as a list of possible return values. See previous question as well as the link.
Always handle as soon as possible. The closer you are to its occurrence the more chance you have to do something meaningful or at the least figure out where and why it happened. In C++, it is not just a matter of context but being impossible to determine in many cases.
In general you should always halt the app if something buggy occurs that is a real error (not something like not finding a file, which is not really something that should count as an error but is labeled as such). It's not going to just sort itself out, and once the app is broken it will cause errors that are impossible to debug because they have nothing to do with the area they occur.
Why not?
see 1.
see 1.
You need to keep things simple, or you will regret it. More important to handling bugs at runtime is testing to avoid them.
It's like saying is it better to centralize or not centralize. It might make a lot of sense in some cases but be a waste of time in others. For something that is a loadable lib/module of some kind that can have errors that are data related (garbage in, garbage out), it makes tons of sense. For more general error handling or catastrophic errors, less.
Error handling is not accompanied by formal theory. It is too 'implementation specific' of a topic to be considered a science field (to be fair there is a great debate whether programming is a science on its own right).
Nontheless it a good part of a developer's work (and thus his/hers life), so practical approaches and technical guidliness have been developed on the topic.
A good view on the topic is presented by A. Alexandrescu, in his talk systematic error handling in C++
I have a repository in GitHub where the techniques presented are implemented.
Basically, what A.A does, is implement a class
template<class T>
class Expected { /* Implementation in the GitHub link */ };
that is meant to be used as a return value. This class could hold either a return value of type T or an exception (pointer). The exception could be either thrown explictly or upon request, yet the rich error information is always available. An example usage would be like this
int foo();
// ....
Expected<int> ret = foo();
if (ret.valid()) {
// do the work
}
else {
// either use the info of the exception
// or throw the exception (eg in an exception "friendly" codebase)
}
While building this framework for error handling, A.A walks us through techniques and designs that produce successfull or poor error handling and what works or what not. He also gives his definitions of 'error' and 'error handling'
My two cents.
How to decide if an error should be handled locally or propagated to higher level code?
Handle errors you can handle. Let errors propagate that you can not.
How to decide between logging an error, or showing it as an error message to the user?
Two orthogonal issues, which are not mutually exclusive. Logging the error is ultimately for you, the developer. If you would be interested in it, log it. Show it to the user if it is actionable to the user ("Error: No network connection!").
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
I see no reason why libraries can't log.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
You should catch them where you can handle them (insert your definition of handle). If you can't handle them, ignore them (maybe someone up the stack can handle them..).
You certainly shouldn't put a try/catch block around each and every throwing function you call.
Similar question: at what point should you stop propagating an error and deal with it?
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
At the first point that you can actually deal with it. That point may not exist, and your app may crash. Then you'll get a nice crash dump, and can update your error handling.
Does it make sense to create a list of error codes? Or is that old fashioned these days?
Another point of contention. I'd actually say no: one super list of all error codes implies that that list is always up to date, so you can actually do harm when it's not up to date. It's better to have each function document all the error codes it can return, rather than have one super list.
How to decide if an error should be handled locally or propagated to higher level code?
Error handling should be done at the highest affected level. If it only impacts the lower level code, then it should be handled there. If the error affects higher level code, then the error needs to be handled at the higher level. This is to prevent some higher level code from going on its merry way after an error has caused its actions to be incorrect. It should know what is going on, provided it is impacted.
How to decide between logging an error, or showing it as an error message to the user?
You should always log the error. You should show the error to the user when they are affected by it. If it is something they will never notice and does not have a direct impact (e.g. two sockets failed to open before the third finally opened, resulting in a very short delay for the user should not be reported), then they should not be notified.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
Too much logging is rarely a bad thing. You will regret not logging things when you have to hunt down a library bug more than you will be frustrated by extra logs when hunting down other bugs.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Similar to error handling above, it should be caught where the impact is, and where the error can be corrected/handled effectively. This will vary from case to case.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
This is largely a personal decision. My internal error handling is much different than the error handling I use for anything that touches a third party library. I have a general idea of what to expect from my code, but the third party stuff could have anything happen to it.
Does it make sense to create a list of error codes? Or is that old fashioned these days?
Depends how much you expect to have errors thrown. You might love your list of error codes if you spend a lot of time bug hunting, as they can help point you in the right direction. However, any time spent building these is less time spent on coding/bug fixing, so its a mixed bag. This largely comes down to personal preference.
The first question is probably what can you do about the error?
Can you fix it (in which case do you need to tell the user) or can the user fix it?
If nobody can fix it and you are going to exit, is there any value in having this reported back to you (through a crash dump or error code)?
I'm changing my design and coding philosophy so that:
If all runs smoothly, as expected,
no errors generated.
Throw an exception if something
different, or unexpected happens;
let the caller handle it.
If it can't be resolved, propagate
it up a higher level.
Hopefully, with this technique, the issues that get propagated to the User will be very important; otherwise the program tries to resolve them.
I'm currently experiencing issues that get lost in the return codes; or new return codes are created.
The book "Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries" book by Krzysztof Cwalina and Brad Abrams has some good suggestions on this. See chapter 7 on Exceptions. For example it favours throwing exceptions to returning error codes.
-Krip