Range analysis of floating point values? - c++

I have an image processing program which uses floating point calculations. However, I need to port it to a processor which does not have floating point support in it. So, I have to change the program to use fixed point calculations. For that I need proper scaling of those floating point numbers, for which I need to know the range of all values, including intermediate values of the floating point calculations.
Is there a method where I just run the program and it automatically give me the range of all the floating point calculations in the program? Trying to figure out the ranges manually would be too cumbersome, so if there is some tool for doing it, that would be awesome!

You could use some "measuring" replacement for your floating type, along these lines (live example):
template<typename T>
class foo
{
T val;
using lim = std::numeric_limits<int>;
static int& min_val() { static int e = lim::max(); return e; }
static int& max_val() { static int e = lim::min(); return e; }
static void sync_min(T e) { if (e < min_val()) min_val() = int(e); }
static void sync_max(T e) { if (e > max_val()) max_val() = int(e); }
static void sync(T v)
{
v = std::abs(v);
T e = v == 0 ? T(1) : std::log10(v);
sync_min(std::floor(e)); sync_max(std::ceil(e));
}
public:
foo(T v = T()) : val(v) { sync(v); }
foo& operator=(T v) { val = v; sync(v); return *this; }
template<typename U> foo(U v) : foo(T(v)) {}
template<typename U> foo& operator=(U v) { return *this = T(v); }
operator T&() { return val; }
operator const T&() const { return val; }
static int min() { return min_val(); }
static int max() { return max_val(); }
};
to be used like
int main ()
{
using F = foo<float>;
F x;
for (F e = -10.2; e <= 30.4; e += .2)
x = std::pow(10, e);
std::cout << F::min() << " " << F::max() << std::endl; // -11 31
}
This means you need to define an alias (say, Float) for your floating type (float or double) and use it consistently throughout your program. This may be inconvenient but it may prove beneficial eventually (because then your program is more generic). If your code is already templated on the floating type, even better.
After this parametrization, you can switch your program to "measuring" or "release" mode by defining Float to be either foo<T> or T, where T is your float or double.
The good thing is that you don't need external tools, your own code carries out the measurements. The bad thing is that, as currently designed, it won't catch all intermediate results. You would have to define all (e.g. arithmetic) operators on foo for this. This can be done but needs some more work.

It is not true that you cannot use floating point code on hardware that does not support floating point - the compiler will provide software routines to perform floating point operations - they just may be rather slow - but if it is fast enough for your application , that is the path of least resistance.
It is probably simplest to implement a fixed point data type class and have its member functions detect over/underflow as a debug option (because the checking will otherwise slow your code).
I suggest you look at Anthony Williams' fixed-Point math C++ library. It is in C++ and defines a fixed class with extensive function and operator overloading, so it can largely be used simply by replacing float or double in your existing code with fixed. It uses int64_t as the underlying integer data type, with 34 integer bits and 28 fractional bits (34Q28), so is good for about 8 decimal places and a wider range than int32_t.
It does not have the under/overflow checking I suggested, but it is a good starting point for you to add your own.
On 32bit ARM this library performs about 5 times faster than software-floating point and is comparable in performance to ARM's VFP unit for C code.
Note that the sqrt() function in this library has poor precision performance for very small values as it looses lower-order bits in intermediate calculations that can be preserved. It can be improved by replacing it with the code the version I presented in this question.

For self-contained C programs, you can use Frama-C's value analysis to obtain ranges for the floating-point variables, for instance h below:
And variable g computed from h:
There is a specification language to describe the ranges of the inputs (information without which it is difficult to say anything informative). In the example above, I used that language to specify what function float_interval was expected to do:
/*# ensures \is_finite(\result) && l <= \result <= u ; */
float float_interval(float l, float u);
Frama-C is easiest to install on Linux, with Debian and Ubuntu binary packages for a recent (but usually not the latest) version available from within the distribution.
If you could post your code, it would help telling if if this approach is realistic. If your code is C++, for instance (your question does not say, it is tagged with several language tags), then the current version of Frama-C will be no help as it only accepts C programs.

Related

Subtract extremely small number from one in C++

I need to subtract extremely small double number x from 1 i.e. to calculate 1-x in C++ for 0<x<1e-16. Because of machine precision restrictions for small enoughf x I will always get 1-x=1. Simple solution is to switch from double to some more precise format like long. But because of some restrictions I can't switch to more precise number formats.
What is the most efficient way to get accurate value of 1-x where x is an extremely small double if I can't use more precise formats and I need to store the result of the subtraction as a double? In practice I would like to avoid percentage errors greater then 1% (between double representation of 1-x and its actual value).
P.S. I am using Rcpp to calculate the quantiles of standard normal distribution via qnorm function. This function is symmetric around 0.5 being much more accurate for values close to 0. Therefore instead of qnorm(1-(1e-30)) I would like to calculate -qnorm(1e-30) but to derive 1e-30 from 1-(1e-30) I need to deal with a precision problem. The restriction on double is due to the fact that as I know it is not safe to use more precise numeric formats in Rcpp. Note that my inputs to qnorm could be sought of exogeneous so I can't to derive 1-x from x durning some preliminary calculations.
Simple solution is to switch from double to some more precise format like long [presumably, double]
In that case you have no solution. long double is an alias for double on all modern machines. I stand corrected, gcc and icc still support it, only cl has dropped support for it for a long time.
So you have two solutions, and they're not mutually exclusive:
Use an arbitrary precision library instead of the built-in types. They're orders of magnitude slower, but if that's the best your algorithm can work with then that's that.
Use a better algorithm, or at least rearrange your equation variables, to not have this need in the first place. Use distribution and cancellation rules to avoid the problem entirely. Without a more in depth description of your problem we can't help you, but I can tell you with certainty that double is more than enough to allow us to model airplane AI and flight parameters anywhere in the world.
Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.
Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).
The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-() and a test case.
#include <iostream>
class ddreal {
private:
static inline double Plus2(double a, double b, double& err) {
double s = a + b;
double bb = s - a;
err = (a - (s - bb)) + (b - bb);
return s;
}
static inline void Plus3(double& a, double& b, double& c) {
double t3, t2, t1 = Plus2(a, b, t2);
a = Plus2(c, t1, t3);
b = Plus2(t2, t3, c);
}
public:
double x[2];
ddreal() { x[0] = x[1] = 0.0; }
ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
ddreal& operator -= (ddreal const& b) {
double t1, t2, s2;
x[0] = Plus2(x[0], -b.x[0], s2);
t1 = Plus2(x[1], -b.x[1], t2);
x[1] = Plus2(s2, t1, t1);
t1 += t2;
Plus3(x[0], x[1], t1);
return *this;
}
inline double toDouble() const { return x[0] + x[1]; }
};
inline ddreal operator-(ddreal const& a, ddreal const& b)
{
ddreal retval = a;
return retval -= b;
}
int main()
{
double sdone{ 1.0 };
double sdwee{ 1.0e-42 };
double sdval = sdone - sdwee;
double sdans = sdone - sdval;
std::cout << sdans << "\n"; // Gives zero, as expected
ddreal ddone{ 1.0 };
ddreal ddwee{ 1.0e-42 };
ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
ddreal ddans = ddone - ddval;
std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42
ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
ddreal ddsec = ddone - ddalt;
std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42
return 0;
}
Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)
I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double type fails completely.
Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise (or /fp:strict) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast will break the code, completely.

Efficient division of an int by intmax

I have an integer of type uint32_t and would like to divide it by a maximum value of uint32_t and obtain the result as a float (in range 0..1).
Naturally, I can do the following:
float result = static_cast<float>(static_cast<double>(value) / static_cast<double>(std::numeric_limits<uint32_t>::max()))
This is however quite a lot of conversions on the way, and a the division itself may be expensive.
Is there a way to achieve the above operation faster, without division and excess type conversions? Or maybe I shouldn't worry because modern compilers are able to generate an efficient code already?
Edit: division by MAX+1, effectively giving me a float in range [0..1) would be fine too.
A bit more context:
I use the above transformation in a time-critical loop, with uint32_t being produced from a relatively fast random-number generator (such as pcg). I expect that the conversions/divisions from the above transformation may have some noticable, albeit not overwhelming, negative impact on the performance of my code.
This sounds like a job for:
std::uniform_real_distribution<float> dist(0.f, 1.f);
I would trust that to give you an unbiased conversion to float in the range [0, 1) as efficiently as possible. If you want the range to be [0, 1] you could use this:
std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f))
Here's an example with two instances of a not-so-random number generator that generates min and max for uint32_t:
#include <iostream>
#include <limits>
#include <random>
struct ui32gen {
constexpr ui32gen(uint32_t x) : value(x) {}
uint32_t operator()() { return value; }
static constexpr uint32_t min() { return 0; }
static constexpr uint32_t max() { return std::numeric_limits<uint32_t>::max(); }
uint32_t value;
};
int main() {
ui32gen min(ui32gen::min());
ui32gen max(ui32gen::max());
std::uniform_real_distribution<float> dist(0.f, 1.f);
std::cout << dist(min) << "\n";
std::cout << dist(max) << "\n";
}
Output:
0
1
Is there a way to achieve the operation faster, without division
and excess type conversions?
If you want to manually do something similar to what uniform_real_distribution does (but much faster, and slightly biased towards lower values), you can define a function like this:
// [0, 1) the common range
inline float zero_to_one_exclusive(uint32_t value) {
static const float f_mul =
std::nextafter(1.f / float(std::numeric_limits<uint32_t>::max()), 0.f);
return float(value) * f_mul;
}
It uses multiplication instead of division since that often is a bit faster (than your original suggestion) and only has one type conversion. Here's a comparison of division vs. multiplication.
If you really want the range to be [0, 1], you can do like below, which will also be slightly biased towards lower values compared to what std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f)) would produce:
// [0, 1] the not so common range
inline float zero_to_one_inclusive(uint32_t value) {
static const float f_mul = 1.f/float(std::numeric_limits<uint32_t>::max());
return float(value) * f_mul;
}
Here's a benchmark comparing uniform_real_distribution to zero_to_one_exclusive and zero_to_one_inclusive.
Two of the casts are superfluous. You dont need to cast to float when anyhow you assign to a float. Also it is sufficient to cast one of the operands to avoid integer arithmetics. So we are left with
float result = static_cast<double>(value) / std::numeric_limits<int>::max();
This last cast you cannot avoid (otherwise you would get integer arithmetics).
Or maybe I shouldn't worry because modern compilers are able to
generate an efficient code already?
Definitely a yes and no! Yes, trust the compiler that it knows best to optimize code and write for readability first. And no, dont blindy trust. Look at the output of the compiler. Compare different versions and measure them.
Is there a way to achieve the above operation faster, without division
[...] ?
Probably yes. Dividing by std::numeric_limits<int>::max() is so special, that I wouldn't be too surprised if the compiler comes with some tricks. My first approach would again be to look at the output of the compiler and maybe compare different compilers. Only if the compilers output turns out to be suboptimal I'd bother to enter some manual bit-fiddling.
For further reading this might be of interest: How expensive is it to convert between int and double? . TL;DR: it actually depends on the hardware.
If performance were a real concern I think I'd be inclined to represent this 'integer that is really a fraction' in its own class and perform any conversion only where necessary.
For example:
#include <iostream>
#include <cstdint>
#include <limits>
struct fraction
{
using value_type = std::uint32_t;
constexpr explicit fraction(value_type num = 0) : numerator_(num) {}
static constexpr auto denominator() -> value_type { return std::numeric_limits<value_type>::max(); }
constexpr auto numerator() const -> value_type { return numerator_; }
constexpr auto as_double() const -> double {
return double(numerator()) / denominator();
}
constexpr auto as_float() const -> float {
return float(as_double());
}
private:
value_type numerator_;
};
auto generate() -> std::uint32_t;
int main()
{
auto frac = fraction(generate());
// use/manipulate/display frac here ...
// ... and finally convert to double/float if necessary
std::cout << frac.as_double() << std::endl;
}
However if you look at code gen on godbolt you'll see that the CPU's floating point instructions take care of the conversion. I'd be inclined to measure performance before you run the risk of wasting time on early optimisation.

preferred mechanism to attach a type to a scalar?

[ edit: changed meters/yards to foo/bar; this isn't about converting meters to yards. ]
What's the best way to attach a type to a scalar such as a double? The typical use-case is units-of-measure (but I'm not looking for an actual implementation, boost has one).
This would appear to be a simple as:
template <typename T>
struct Double final
{
typedef T type;
double value;
};
namespace tags
{
struct foo final {};
struct bar final {};
}
constexpr double FOOS_TO_BARS_ = 3.141592654;
inline Double<tags::bar> to_bars(const Double<tags::foo>& foos)
{
return Double<tags::bar> { foos.value * FOOS_TO_BARS_ };
}
static void test(double value)
{
using namespace tags;
const Double<foo> value_in_foos{ value };
const Double<bar> value_in_bars = to_bars(value_in_foos);
}
Is that really the case? Or are there hidden complexities or other important considerations to this approach?
This would seem far, far superior to
inline double foos_to_bars(double foos)
{
return foos * FOOS_TO_BARS_;
}
without adding hardly any complexity or overhead.
I'd go with a ratio-based approach, much like std::chrono. (Howard Hinnant shows it in his recent C++Con 2016 talk about <chrono>)
template<typename Ratio = std::ratio<1>, typename T = double>
struct Distance
{
using ratio = Ratio;
T value;
};
template<typename To, typename From>
To distance_cast(From f)
{
using r = std::ratio_divide<typename To::ratio, typename From::ratio>;
return To{ f.value * r::den / r::num };
}
using yard = Distance<std::ratio<10936133,10000000>>;
using meter = Distance<>;
using kilometer = Distance<std::kilo>;
using foot = Distance<std::ratio<3048,10000>>;
demo
This is a naive implementation and probably could be improved a lot (at the very least by allowing implicit casts where they're safe), but it's a proof of concept and it's trivially extensible.
Pros:
meter m = yard{10} is either a compile time error or a safe implicit conversion,
pretty type names, you'd have to work against the solution very hard to make an invalid conversion
simple to use
Cons:
Possible integer overflows/precision problems (may be alleviated by quality of implementation?)
may be non-trivial to implement correctly
Firstly, yes, I think the way you have suggested is quite reasonable, though whether it is to be preferred would depend on the context.
Your approach has the advantage that you define conversions that might not just be simple multiplications (example Celsius and Fahrenheit).
Your method however does create different types, which leads to a need to create conversions, this can be good or bad depending on the use.
(I appreciate that your yards and metres were just an example, I'll use it as an just as an example too)
If I'm writing code that deals with lengths, (most of) the logic is going to be the same whatever the units. Whilst I could make the function that contains that logic a template so it can take different units, there's still a reasonable use case where data is needed from 2 different sources and is supplied in to different units. In this situation I'd rather be dealing in one Length class rather than a class per unit, these lengths could either hold their conversion information or it could just use one fixed unit with conversion being done at the input/output stages.
On the other hand when we have different types for different measurements e.g. length, area, temperature. Not having default conversions between these types is a good thing. And it's good that I can't accidently add a length to a temperature.
(Of course multiplication of types is different.)
In my opinion, your approach is over-designed to the point that bugs have crept in that are hard to spot. Even at this point the syntactic complexity you have introduced has allowed your conversion to become inaccurate: you are out from the 8th decimal significant figure.
The standard conversion is 1 inch is 25.4mm which means that one yard is exactly 0.9144m.
Neither this nor its reciprocal can be represented exactly in IEEE754 binary floating point.
If I were you I'd define
constexpr double METERS_IN_YARDS = 0.9144;
constexpr double YARDS_IN_METERS = 1.0 / 0.9144;
to keep the bugs away, and work in double precision floating point arithmetic the old-fashioned way.

Float comparisons failing without any obvious reason (32-bit X86 on Linux)

I have stumbled upon an interesting case of comparing (==, !=) float types.
I encountered this problem while porting my own software from windows to linux. It's a bit of a bummer. The relevant code is the following:
template<class T> class PCMVector2 {
public:
T x, y;
public:
bool operator == ( const PCMVector2<T>& a ) const {
return x == a.x && y == a.y;
}
bool operator != ( const PCMVector2<T>& a ) const {
return x != a.x || y != a.y;
}
// Mutable normalization
PCMVector2<T>& Normalize() {
const T l = 1.0f / Length();
x *= l;
y *= l;
return *this;
}
// Immutable normalization
const PCMVector2<T> Normalized() {
const T l = 1.0f / Length();
return PCMVector2<T>(x*l,y*l);
}
// Vector length
T Length() const { return sqrt(x*x+y*y); }
};
I cleverly designed a unit test functions which check all available functionality regarding those classes, before porting to linux. And, in contrast to msvc, the g++ doesn't complain, but gives incorrect results at runtime.
I was stumped, so I did some additional logging, type-puns, memcmp's, etc. and they all showed that memory is 1:1 the same! Anyone has any ideas about this?
My flags are: -Wall -O2 -j2
Thanks in advance.
EDIT2: The failed test case is:
vec2f v1 = vec2f(2.0f,3.0f);
v1.Normalize(); // mutable normalization
if( v1 != vec2f(2.0f,3.0f).Normalized() ) //immutable normalization
// report failure
Note: Both normalizations are the same, and yield same results (according to memcmp).
RESOLUTION: Turns out that you should never trust the compiler about floating numbers! No matter how sure you are about the memory you compare. Once data goes to the registers, it can change, and you have no control over it. After some digging regarding registers, I found this neat source of information. Hope it's useful to someone in the future.
Floating point CPU registers can be larger than the floating point type you're working with. This is especially true with float which is typically only 32 bits. A calculation will be computed using all the bits, then the result will be rounded to the nearest representable value before being stored in memory.
Depending on inlining and compiler optimization flags, it is possible that the generated code may compare one value from memory with another one from a register. Those may compare as unequal, even though their representation in memory will be bit-for-bit identical.
This is only one of the many reasons why comparing floating-point values for equality is not recommended. Especially when, as in your case, it appears to work some of the time.

Converting floating point to fixed point

In C++, what's the generic way to convert any floating point value (float) to fixed point (int, 16:16 or 24:8)?
EDIT: For clarification, fixed-point values have two parts to them: an integer part and a fractional part. The integer part can be represented by a signed or unsigned integer data type. The fractional part is represented by an unsigned data integer data type.
Let's make an analogy with money for the sake of clarity. The fractional part may represent cents -- a fractional part of a dollar. The range of the 'cents' data type would be 0 to 99. If a 8-bit unsigned integer were to be used for fixed-point math, then the fractional part would be split into 256 evenly divisible parts.
I hope that clears things up.
Here you go:
// A signed fixed-point 16:16 class
class FixedPoint_16_16
{
short intPart;
unsigned short fracPart;
public:
FixedPoint_16_16(double d)
{
*this = d; // calls operator=
}
FixedPoint_16_16& operator=(double d)
{
intPart = static_cast<short>(d);
fracPart = static_cast<unsigned short>
(numeric_limits<unsigned short> + 1.0)*d);
return *this;
}
// Other operators can be defined here
};
EDIT: Here's a more general class based on anothercommon way to deal with fixed-point numbers (and which KPexEA pointed out):
template <class BaseType, size_t FracDigits>
class fixed_point
{
const static BaseType factor = 1 << FracDigits;
BaseType data;
public:
fixed_point(double d)
{
*this = d; // calls operator=
}
fixed_point& operator=(double d)
{
data = static_cast<BaseType>(d*factor);
return *this;
}
BaseType raw_data() const
{
return data;
}
// Other operators can be defined here
};
fixed_point<int, 8> fp1; // Will be signed 24:8 (if int is 32-bits)
fixed_point<unsigned int, 16> fp1; // Will be unsigned 16:16 (if int is 32-bits)
A cast from float to integer will throw away the fractional portion so if you want to keep that fraction around as fixed point then you just multiply the float before casting it. The below code will not check for overflow mind you.
If you want 16:16
double f = 1.2345;
int n;
n=(int)(f*65536);
if you want 24:8
double f = 1.2345;
int n;
n=(int)(f*256);
**** Edit** : My first comment applies to before Kevin's edit,but I'll leave it here for posterity. Answers change so quickly here sometimes!
The problem with Kevin's approach is that with Fixed Point you are normally packing into a guaranteed word size (typically 32bits). Declaring the two parts separately leaves you to the whim of your compiler's structure packing. Yes you could force it, but it does not work for anything other than 16:16 representation.
KPexEA is closer to the mark by packing everything into int - although I would use "signed long" to try and be explicit on 32bits. Then you can use his approach for generating the fixed point value, and bit slicing do extract the component parts again. His suggestion also covers the 24:8 case.
( And everyone else who suggested just static_cast.....what were you thinking? ;) )
I gave the answer to the guy that wrote the best answer, but I really used a related questions code that points here.
It used templates and was easy to ditch dependencies on the boost lib.
This is fine for converting from floating point to integer, but the O.P. also wanted fixed point.
Now how you'd do that in C++, I don't know (C++ not being something I can think in readily). Perhaps try a scaled-integer approach, i.e. use a 32 or 64 bit integer and programmatically allocate the last, say, 6 digits to what's on the right hand side of the decimal point.
There isn't any built in support in C++ for fixed point numbers. Your best bet would be to write a wrapper 'FixedInt' class that takes doubles and converts them.
As for a generic method to convert... the int part is easy enough, just grab the integer part of the value and store it in the upper bits... decimal part would be something along the lines of:
for (int i = 1; i <= precision; i++)
{
if (decimal_part > 1.f/(float)(i + 1)
{
decimal_part -= 1.f/(float)(i + 1);
fixint_value |= (1 << precision - i);
}
}
although this is likely to contain bugs still