floating point exceptions in LINUX -- how to convert them into something useful

floating point exceptions in LINUX -- how to convert them into something useful - c++

I've got some piece of code calculating a vector of functions of a vector of independent variables (and parameters):
struct instance
{ double m_adParam1, m_adParam2;
std::array<double, OUTSIZE> calculate(const std::array<double, INSIZE>&x) const;
};
When any part of this code fails due to a division by (nearly) zero or due to a domain error, the entire code should fail and tell the caller, that it cannot calculate this function for the given input.
This should also apply for calculated values used for comparison only. So testing the result vector for NAN or infinity is not sufficient. Inserting tests after every fallible operation is infeasible.
How could one do this?
I read about fetestexcept(). But it is unclear what is resetting this state flag.
SIGFPE is useless for anything other than debugging.
Having the ability to convert this into C++ exceptions as on Windows would be great, but inserting a final check would be sufficient as well.
Inserting code to check for invalid inputs for the following code is also infeasible, as it would slow down the calculation dramatically.

I read about fetestexcept(). But it is unclear what is resetting this state flag.
That is what feclearexcept() does. So you can do something like
std::feclearexcept(FE_INVALID | FE_OVERFLOW | FE_DIVBYZERO);
// do lots of calculations
// now check for exceptions
if (std::fetestexcept(FE_INVALID))
throw std::domain_error("argh");
else if (std::fetestexcept(FE_OVERFLOW))
throw std::overflow_error("eek");
// ...
The exception flags are "sticky" and no other operation, besides an explicit feclearexcept(), should clear them.
The main benefits of using feenablexcept() to force a SIGFPE are that you would detect the error immediately, without waiting until you happen to get around to testing the exception flag, and that you could pinpoint the exact instruction that caused the exception. It doesn't sound like either of those are needed in your use case.

Related

Signaling or catching 'nan' as they occur in computations in numerical code base in c++

We have numerical code written in C++. Rarely but under certain specific inputs, some of the computations result in an 'nan' value.
Is there a standard or recommended method by which we can stop and alert the user when a certain numerical calculation results in an 'nan' being generated? (under debug mode).Checking for each result if it is equal to 'nan' seems impractical given the huge sizes of matrices and vectors.
How do standard numerical libraries handle this situation? Could you throw some light on this?

NaN is propagated, when applied to a numeric operation. So, it is enough to check the final result for being a NaN. As for, how to do it -- if building for >= C++11, there is std::isnan, as Goz noticed. For < C++11 - if want to be bulletproof - I would personally do bit-checking (especially, if there may be an optimization involved). The pattern for NaN is
? 11.......1 xx.......x
sign bit ^ ^exponent^ ^fraction^
where ? may be anything, and at least one x must be 1.
For platform dependent solution, there seams to be yet another possibility. There is the function feenableexcept in glibc (probably with the signal function and the compiler option -fnon-call-exceptions), which turns on a generation of the SIGFPE sinals, when an invalid floating point operation occure. And the function _control87 (probably with the _set_se_translator function and compiler option /EHa), which allows pretty much the same in VC.

Although this is a nonstandard extension originally from glibc, on many systems you can use the feenableexcept routine declared in <fenv.h> to request that the machine trap particular floating-point exceptions and deliver SIGFPE to your process. You can use fedisableexcept to mask trapping, and fegetexcept to query the set of exceptions that are unmasked. By default they are all masked.
On older BSD systems without these routines, you can use fpsetmask and fpgetmask from <ieeefp.h> instead, but the world seems to be converging on the glibc API.
Warning: glibc currently has a bug whereby (the C99 standard routine) fegetenv has the unintended side effect of masking all exception traps on x86, so you have to call fesetenv to restore them afterward. (Shows you how heavily anyone relies on this stuff...)

On many architectures, you can unmask the invalid exception, which will cause an interrupt when a NaN would ordinarily be generated by a computation such as 0*infinity. Running in the debugger, you will break on this interrupt and can examine the computation that led to that point. Outside of a debugger, you can install a trap handler to log information about the state of the computation that produced the invalid operation.
On x86, for example, you would clear the Invalid Operation Mask bit in FPCR (bit 0) and MXCSR (bit 7) to enable trapping for invalid operations from x87 and SSE operations, respectively.
Some individual platforms provide a means to write to these control registers from C, but there's no portable interface that works cross-platform.

Testing f!=f might give problems using g++ with -ffast-math optimization enabled: Checking if a double (or float) is NaN in C++
The only foolproof way is to check the bitpattern.
As to where to implement the checks, this is really dependent on the specifics of your calculation and how frequent Nan errors are i.e. performance penalty of continuing tainted calculations versus checking at certain stages.

Handling Floating-Point exceptions in C++

I'm finding the floating-point model/error issues quite confusing. It's an area I'm not familiar with and I'm not a low level C/asm programmer, so I would appreciate a bit of advice.
I have a largish C++ application built with VS2012 (VC11) that I have configured to throw floating-point exceptions (or more precisely, to allow the C++ runtime and/or hardware to throw fp-exceptions) - and it is throwing quite a lot of them in the release (optimized) build, but not in the debug build. I assume this is due to the optimizations and perhaps the floating-point model (although the compiler /fp:precise switch is set for both the release and debug builds).
My first question relates to managing the debugging of the app. I want to control where fp-exceptions are thrown and where they are "masked". This is needed because I am debugging the (optimized) release build (which is where the fp-exceptions occur) - and I want to disable fp-exceptions in certain functions where I have detected problems, so I can then locate new FP problems. But I am confused by the difference between using _controlfp_s to do this (which works fine) and the compiler (and #pragma float_control) switch "/fp:except" (which seems to have no effect). What is the difference between these two mechanisms? Are they supposed to have the same effect on fp exceptions?
Secondly, I am getting a number of "Floating-point stack check" exceptions - including one that seems to be thrown in a call to the GDI+ dll. Searching around the web, the few mentions of this exception seem to indicate it is due to compiler bugs. Is this generally the case? If so, how should I work round this? Is it best to disable compiler optimizations for the problem functions, or to disable fp-exceptions just for the problematic areas of code if there don't appear to be any bad floating-point values returned? For example, in the GDI+ call (to GraphicsPath::GetPointCount) that throws this exception, the actual returned integer value seems correct. Currently I'm using _controlfp_s to disable fp-exceptions immediately prior to the GDI+ call – and then use it again to re-enable exceptions directly after the call.
Finally, my application does make a lot of floating-point calculations and needs to be robust and reliable, but not necessarily hugely accurate. The nature of the application is that the floating-point values generally indicate probabilities, so are inherently somewhat imprecise. However, I want to trap any pure logic errors, such as divide by zero. What is the best fp model for this? Currently I am:
trapping all fp exceptions (i.e. EM_OVERFLOW | EM_UNDERFLOW | EM_ZERODIVIDE | EM_DENORMAL | EM_INVALID) using _controlfp_s and a SIGFPE Signal handler,
have enabled the denormals-are-zero (DAZ) and flush-to-zero (FTZ) (i.e. _MM_SET_FLUSH_ZERO_MODE(_MM_DENORMALS_ZERO_ON)), and
I am using the default VC11 compiler settings /fp:precise with /fp:except not specified.
Is this the best model?
Thanks and regards!

Most of the the following information comes from Bruce Dawson's blog post on the subject (link).
Since you're working with C++, you can create a RAII class that enables or disables floating point exceptions in a scoped manner. This lets you have greater control so that you're only exposing the exception state to your code, rather than manually managing calling _controlfp_s() yourself. In addition, floating point exception state that is set this way is system wide, so it's really advisable to remember the previous state of the control word and restore it when needed. RAII can take care of this for you and is a good solution for the issues with GDI+ that you're describing.
The exception flags _EM_OVERFLOW, _EM_ZERODIVIDE, and _EM_INVALID are the most important to account for. _EM_OVERFLOW is raised when positive or negative infinity is the result of a calculation, whereas _EM_INVALID is raised when a result is a signaling NaN. _EM_UNDERFLOW is safe to ignore; it signals when your computation result is non-zero and between -FLT_MIN and FLT_MIN (in other words, when you generate a denormal). _EM_INEXACT is raised too frequently to be of any practical use due to the nature of floating point arithmetic, although it can be informative if trying to track down imprecise results in some situations.
SIMD code adds more wrinkles to the mix; since you don't indicate using SIMD explicitly I'll leave out a discussion of that except to note that specifying anything other than /fp:fast can disable automatic vectorization of your code in VS 2012; see this answer for details on this.

I can't help much with the first two questions, but I have experience and a suggestion for the question about masking FPU exceptions.
I've found the functions
_statusfp() (x64 and Win32)
_statusfp2() (Win32 only)
_fpreset()
_controlfp_s()
_clearfp()
_matherr()
useful when debugging FPU exceptions and in delivering a stable and fast product.
When debugging, I selectively unmask exceptions to help isolate the line of code where an fpu exception is generated in a calculation where I cannot avoid calling other code that unpredictably generates fpu exceptions (like the .NET JIT's divide by zeros).
In released product I use them to deliver a stable program that can tolerate serious floating point exceptions, detect when they occur, and recover gracefully.
I mask all FPU exceptions when I have to call code that cannot be changed,does not have reliable exception handing, and occasionally generates FPU exceptions.
Example:
#define BAD_FPU_EX (_EM_OVERFLOW | _EM_ZERODIVIDE | _EM_INVALID)
#define COMMON_FPU_EX (_EM_INEXACT | _EM_UNDERFLOW | _EM_DENORMAL)
#define ALL_FPU_EX (BAD_FPU_EX | COMMON_FPU_EX)
Release code:
_fpreset();
Use _controlfp_s() to mask ALL_FPU_EX
_clearfp();
... calculation
unsigned int bad_fpu_ex = (BAD_FPU_EX & _statusfp());
_clearfp(); // to prevent reacting to existing status flags again
if ( 0 != bad_fpu_ex )
{
... use fallback calculation
... discard result and return error code
... throw exception with useful information
}
Debug code:
_fpreset();
_clearfp();
Use _controlfp_s() to mask COMMON_FPU_EX and unmask BAD_FPU_EX
... calculation
"crash" in debugger on the line of code that is generating the "bad" exception.
Depending on your compiler options, release builds may be using intrinsic calls to FPU ops and debug builds may call math library functions. These two methods can have significantly different error handling behavior for invalid operations like sqrt(-1.0).
Using executables built with VS2010 on 64-bit Windows 7, I have generated slightly different double precision arithmetic values when using identical code on Win32 and x64 platforms. Even using non-optimized debug builds with /fp::precise, the fpu precision control explicitly set to _PC_53, and the fpu rounding control explicitly set to _RC_NEAR. I had to adjust some regression tests that compare double precision values to take the platform into account. I don't know if this is still an issue with VS2012, but heads up.

I've been struggling for achieving some information about handling floating point exceptions on linux and I can tell you what I learned:
There are a few ways of enabling the exception mechanism:
fesetenv (FE_NOMASK_ENV); enables all exceptions
feenableexcept(FE_ALL_EXCEPT );
fpu_control_t fw;
_FPU_GETCW(fw);
fw |=FE_ALL_EXCEPT;
_FPU_SETCW(fw);
4.
> fenv_t envp; include bits/fenv.h
> fegetenv(&envp);
envp.__control_word |= ~_FPU_MASK_OM;
> fesetenv(&envp);
5.
> fpu_control_t cw;
> __asm__ ("fnstcw %0" : "=m" (*&cw));get config word
>cw |= ~FE_UNDERFLOW;
> __asm__ ("fldcw %0" : : "m" (*&cw));write config word
6.C++ mode: std::feclearexcept(FE_ALL_EXCEPT);
There are some useful links :
http://frs.web.cern.ch/frs/Source/MAC_headers/fpu_control.h
http://en.cppreference.com/w/cpp/numeric/fenv/fetestexcept
http://technopark02.blogspot.ro/2005/10/handling-sigfpe.html

(How) do you handle possible integer overflows in C++ code?

Every now and then, especially when doing 64bit builds of some code base, I notice that there are plenty of cases where integer overflows are possible. The most common case is that I do something like this:
// Creates a QPixmap out of some block of data; this function comes from library A
QPixmap createFromData( const char *data, unsigned int len );
const std::vector<char> buf = createScreenShot();
return createFromData( &buf[0], buf.size() ); // <-- warning here in 64bit builds
The thing is that std::vector::size() nicely returns a size_t (which is 8 bytes in 64bit builds) but the function happens to take an unsigned int (which is still only 4 bytes in 64bit builds). So the compiler warns correctly.
If possible, I try to fix up the signatures to use the correct types in the first place. However, I'm often hitting this problem when combining functions from different libraries which I cannot modify. Unfortunately, I often resort to some reasoning along the lines of "Okay, nobody will ever do a screenshot generating more than 4GB of data, so why bother" and just change the code to do
return createFromData( &buf[0], static_cast<unsigned int>( buf.size() ) );
So that the compiler shuts up. However, this feels really evil. So I've been considering to have some sort of runtime assertion which at least yields a nice error in the debug builds, as in:
assert( buf.size() < std::numeric_limits<unsigned int>::maximum() );
This is a bit nicer already, but I wonder: how do you deal with this sort of problem, that is: integer overflows which are "almost" impossible (in practice). I guess that means that they don't occur for you, they don't occur for QA - but they explode in the face of the customer.

If you can't fix the types (because you can't break library compatibility), and you're "confident" that the size will never get that big, you can use boost::numeric_cast in place of the static_cast. This will throw an exception if the value is too big.
Of course the surrounding code then has to do something vaguely sensible with the exception - since it's a "not expected ever to happen" condition, that might just mean shutting down cleanly. Still better than continuing with the wrong size.

The solution depends on context. In some cases, you know where the data
comes from, and can exclude overflow: an int that is initialized with
0 and incremented once a second, for example, isn't going to overflow
anytime in the lifetime of the machine. In such cases, you just convert
(or allow the implicit conversion to do its stuff), and don't worry
about it.
Another type of case is fairly similar: in your case, for example, it's
probably not reasonable for a screen schot to have more data that can be
represented by an int, so the conversion is also safe. Provided the
data really did come from a screen shot; in such cases, the usual
procedure is to validate the data on input, ensuring that it fulfills
your constraints downstream, and then do no further validation.
Finally, if you have no real control over where the data is coming from,
and can't validate on entry (at least not for your constraints
downstream), you're stuck with using some sort of checking conversion,
validating immediately at the point of conversion.

If you push a 64-bit overflowing number into a 32-bit library you open pandora's box -- undefined behaviour.
Throw an exception. Since exceptions can in general spring up arbitrarily anywhere you should have suitable code to catch it anyway. Given that, you may as well exploit it.
Error messages are unpleasant but they're better than undefined behaviour.

Such scenarios can be held in one of four ways or using a combination of them:
use right types
use static assertions
use runtime assertions
ignore until hurts
Usually the best is to use right types right until your code gets ugly and then roll in static assertions. Static assertions are much better than runtime assertions for this very purpose.
Finally, when static assertions won't work (like in your example) you use runtime assertions - yes, they get into customers' faces, but at least your program behaves predictably. Yes, customers don't like assertions - they start panic ("we have error!" in all caps), but without an assertion the program would likely misbehave and no way to easily diagnose the problem would be.

One thing just came to my mind: since I need some sort of runtime check (whether or not the value of e.g. buf.size() exceeds the range of unsigned int can only be tested at runtime), but I do not want to have a million assert() invocations everywhere, I could do something like
template <typename T, typename U>
T integer_cast( U v ) {
assert( v < std::numeric_limits<T>::maximum() );
return static_cast<T>( v );
}
That way, I would at least have the assertion centralized, and
return createFromData( &buf[0], integer_cast<unsigned int>( buf.size() ) );
Is a tiny bit better. Maybe I should rather throw an exception (it is quite exceptional indeed!) instead of assert'ing, to give the caller a chance to handle the situation gracefully by rolling back previous work and issueing diagnostic output or the like.

Coding conventions for method returns in C++

I have observed that the general coding convention for a successful completion of a method intended functionality is 0. (As in exit(0)).
This kind of confuses me because, if I have method in my if statement and method returns a 0, by the "if condition" is false and thereby urging me to think for a minute that the method had failed. Of course I do know I have to append with a "!" (As in if(!Method()) ), but isn't this convention kind of self contradicting itself ??

You need to differentiate between an error code and an error flag. A code is a number representing any number of errors, while a flag is a boolean that indicates success.
When it comes to error codes, the idea is: There is only one way to succeed, but there are many ways to fail. Take 0 as a good single unique number, representing success, then you have every other number is a way of indicating failure. (It doesn't make sense any other way.)
When it comes to error flags, the idea is simple: True means it worked, false means it didn't. You can usually then get the error code by some other means, and act accordingly.
Some functions use error codes, some use error flags. There's nothing confusing or backwards about it, unless you're trying to treat everything as a flag. Not all return values are a flag, that's just something you'll have to get used to.
Keep in mind in C++ you generally handle errors with exceptions. Instead of looking up an error code, you just get the necessary information out of the caught exception.

The convention isn't contradicting itself, it's contradicting your desired use of the function.
One of them has to change, and it's not going to be the convention ;-)
I would usually write either:
if (Function() == 0) {
// code for success
} else {
// code for failure
}
or if (Function() != 0) with the cases the other way around.
Integers can be implicitly converted to boolean, but that doesn't mean you always should. 0 here means 0, it doesn't mean false. If you really want to, you could write:
int error = Function();
if (!error) {
// code for success
} else {
// code for failure, perhaps using error value
}

There are various conventions, but most common for C functions is to return 0 on failure and a positive value on success so you can just use it inside an if statement (almost all CPUs have conditional jumps which can test whether a value is 0 or not which is why in C this "abused" with 0 meansing false and everything else meaning true).
Another convention is to return -1 on error and some other value on success instead (you especially see this with POSIX functions that set the errno variable). And this is where 0 can be interpreted as "success".
Then there's exit. It is different, because the value returned is not to be interpreted by C, but by a shell. And here the value 0 means success, and every other value means an error condition (a lot of tools tell you what type of error occurred with this value). This is because in the shell, you normally only have a range of 0-127 for returning meaningful values (historic reasons, it's a unsigned byte and everything above 127 means killed by some signal IIRC).

You have tagged your question as [c] and [c++]. Which is it? Because the answer will differ somewhat.
You said:
I have observed that the general coding convention for a successful completion of a method intended functionality is 0.
For C, this is fine.
For C++, it decidedly is not. C++ has another mechanism to signal failure (namely exceptions). Abusing (numeric) return values for that is usually a sign of bad design.
If exceptions are a no-go for some reasons, there are other ways to signal failure without clogging the method’s return type. Alternatives comprise returning a bool (consider a method try_insert) or using an invalid/reserved return value for failure, such as string::npos that is used by the string::find method when no occurrence is found in the string.

exit(0) is a very special case because it's a sentinel value requesting that the compiler tell the operating system to return whatever the real success value is on that OS. It's entirely possible that it won't be the number 0.
As you say, many functions return 0 for success, but they're mainly "legacy" C library OS-interfacing functions, and following the interfacing style of the operating system's on which C was first developed and deployed.
In C++, 0 may therefore be a success value when wrapping such a C legacy interface. Another situation where you might consider using 0 for success is when you are effectively returning an error code, such that all errors are non-zero values and 0 makes sense as a not-an-error value. So, don't think of the return value as a boolean (even though C++ will implicitly convert it to one), but as an error code where 0 means "no error". (In practice, using an enum is typically best).
Still, you should generally return a boolean value from functions that are predicates of some form, such as is_empty(), has_dependencies(), can_fit() etc., and typically throw an exception on error. Alterantively, use a sub-system (and perhaps thread) specific value for error codes as per libc's errno, or accept a separate reference/pointer argument to the variable to be loaded with the error code.

int retCode = SaveWork();
if(retCode == 0) {
//Success !
} else if(retCode == ERR_PERMISSIONS) {
//User doesn't have permissions, inform him
//and let him chose another place
} else if(retCode == ERR_NO_SPACE) {
//No space left to save the work. Figure out something.
} else {
//I give up, user is screwd.
}
So, if 0/false were returned to mean failure, you could not distinguish what was the cause of the error.
For C++, you could use exceptions to distinguish between different errors. You could also use
a global variable, akin to errno which you inspect in case of a failure. When neither exceptions or global variables are desired, returning an error code is commonly used.

As the other answers have already said, using 0 for success leaves everything non-zero for failure. Often (though not always) this means that individual non-zero values are used to indicate the type of failure. So as has already been said, in that situation you have to think of it as an error code instead of an success/failure flag.
And in that kind of situation, I absolutely loathe seeing a statement like if(!Method()) which is actually a test for success. For myself too, I've found it can cause a moment of wondering whether that statement is testing for success or failure. In my opinion, since it's not a simple boolean return, it shouldn't be written like one.
Ideally, since the return is being used as an error code the author of that function should have provided an enum (or at least a set of defines) that can be used in place of the raw numbers. If so, then I would always prefer the test rewritten as something like if(Method() == METHOD_SUCCESS). If the author didn't provide that but you have documentation on the error code values, then consider making your own enum or defines and using that.
If nothing else, I would write it as if(Method() == 0) because then at least it should still be clear to the reader that Method doesn't return a simple boolean and that zero has a specific meaning.
While that convention is often used, it certainly isn't the only one. I've seen conventions that distinguish success and failure by using positive and negative values. This particularly happens when a function returns a count upon success, but needs to returns something that isn't a value count upon failure (often -1). I've also seen variations using unsigned numbers where (unsigned int)-1 (aka 0xffffffff) represented an error. And there are probably others that I can't even think of offhand.
Since there's no single right way to do it, various authors at various times have invented various schemes.
And of course this is all without mentioning that exceptions offer (among other things) a totally different way to provide information when a function has an error.

I use this convention in my code:
int r = Count();
if(r >= 0) {
// Function successful, r contains a useful non-negative value
}
else {
// A negative r represents an error code
}
I tend to avoid bool return values, since they have no advantages (not even regarding their size, which is rounded to a byte) and limits possible extensions of the function's return values.
Before using exceptions, take into account the performance and memory issues they bring.

Why do we follow opposite conventions while returning from main()?

I have gone through this and this,
but the question I am asking here is that why is 0 considered a Success?
We always associate 0 with false, don't we?

Because there are more fail cases than success cases.
Usually, there is only one reason we succeed (because we're successful :)), but there are a whole lot of reasons why we could fail. So 0 means success, and everything else means failure, and the value could be used to report the reason.
For functions in your code, this is different, because you are the one specifying the interface, and thus can just use a bool if it suffices. For main, there is one fixed interface for returns, and there may be programs that just report succeed/fail, but others that need more fine error reporting. To satisfy them all, we will have multiple error cases.

I have to quibble with with Johannes' answer a bit. True 0 is used for success because there is only 1 successful outcome while there can be many unsuccessful outcomes. But my experience is that return codes have less to do with reasons for failure than levels of failure.
Back in the days of batch programming there were usually conventions for return codes that allowed for some automation of the overall stream of execution. So a return code of 4 might be a warning but the next job could continue; an 8 might mean the job stream should stop; a 12 might mean something catastrophic happened and the fire department should be notified.
Similarly, batches would set aside some range of return codes so that the overall batch stream could branch. If an update program returned XX, for instance, the batch might skip a backup step because nothing changed.
Return codes as reasons for failure aren't all that helpful, certainly not nearly as much as log files, core dumps, console alerts, and whatnot. I have never seen a system that returns XX because "such and such file was not found", for instance.

Generally the return values for any given program tend to be a list (enum) of possible values, such as Success or specific errors. As a "list", this list generally tends to start at 0 and count upwards. (As an aside, this is partly why the Microsoft Error Code 0 is ERROR_SUCCESS).
Globally speaking, Success tends to be one of the only return values that almost any program should be capable of returning. Even if a program has several different error values, Success tends to be a shared necessity, and as such is given the most common position in a list of return values.
It's just the simplest way to allow a most common return value by default. It's completely separate from the idea of a boolean.

Here's the convention that I'm used to from various companies (although this obviously varies from place to place):
A negative number indicates an error occured. The value of the negative number indicates (hopefully) the type of error.
A zero indicates success (generic success)
A positive number indicates a type of success (under some business cases there various things that trigger a successful case....this indicates which successful case happened).

I, too, found this confusing when I first started programming. I resolved it in my mind by saying, 0 means no problems.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js