issue in std::max() function comparision with fixed point implementation - c++

Is there any standard function available which can help me to compare the max() or min() between two float values ?
I have written the fixed point implementation for this min() and max() function from q0s32 to q32s0 type (33 types).
But I want to test the precision loss of my function with the std:min() and std::max() function .But results are not good from std functions .
I tried this way, but that did not work for me as result is not as per the expectation .
Code :
float num1 = 4.5000000054f;
float num2 = 4.5000000057f;
float resf = std::max(num1,num2);
printf("Result is :%20.15f\n",resf);
printf("num1 :%20.15f and num2 :%20.15f\n",num1,num2);
Output:
Result is : 4.500000000000000
num1 : 4.500000000000000 and num2 : 4.500000000000000

Most implementations of c++ use the IEEE 754 standard for floating point arithmetic. Here is some useful information regarding this issue
In IEEE 754 float is a 32 bit single precision Floating Point Number (1 bit for the sign, 8 bits for the exponent, and 23* for the value), i.e. float has 7 decimal digits of precision.
In IEEE 754 double is a 64 bit double precision Floating Point Number (1 bit for the sign, 11 bits for the exponent, and 52* bits for the value), i.e. double has 15 decimal digits of precision.
You need to use double instead to get the desired results.

Related

What is the need of suffix 'f' when define the variable to be float type in c++? [duplicate]

Can there be a difference in bit-representation between a direct assignment of a floating point literal float x = 3.2f; and a double implicitly converted to a float float x2 = 3.2;?
I.e. is
#define EQUAL(FLOAT_LITERAL)\
FLOAT_LITERAL##f == static_cast<float>(FLOAT_LITERAL)
EQUAL(3.2) && EQUAL(55.6200093490) // etc ...
true for all floating point literals?
I ask this question, because clang or gcc do not complain about narrowing conversions if numbers are in the value range of float:
Warning is enabled with -Wnarrowing:
float f {3.422222222222222222222222222222246454}; // no warning/ error although it should definitely lose precision
float f2 {static_cast<double>(std::numeric_limits<float>::max()) + 1.0}; // no warning/ error
float f3 {3.5e38}; // error: narrowing conversion of '3.5e+38' from 'double' to 'float' inside { } [-Wnarrowing]
It is great that the compiler does actual range checks, but is that sufficient?
Assuming IEEE 754, float as 32 bit binary, double as 64 bit binary.
There are decimal fractions that round differently, under IEEE 754 round-to-nearest rules, if converted directly from decimal to float from the result of first converting from decimal to double and then to float.
For example, consider 1.0000000596046447753906250000000000000000000000000001
1.000000059604644775390625 is exactly representable as a double and is exactly half way between 1.0 and 1.00000011920928955078125, the value of the smallest float greater than 1.0. 1.0000000596046447753906250000000000000000000000000001 rounds up to 1.00000011920928955078125 if converted directly, because it is greater than the mid point. If it is first converted to 64 bit, round to nearest takes it to the mid point 1.000000059604644775390625, and then round half even rounds down to 1.0.
The answer given by Patricia is correct. But we generally don't type such number, so maybe it's not a problem... Unless it happens with some shorter decimal literals?
I have illustrated that once in the comments following that answer Count number of digits after `.` in floating point numbers?
The decimal value 7.038531e-26 is approximately 0x1.5C87FAFFFFFFFCE4F6700...p-21 the nearest double is 0x1.5C87FB0000000p-21 and the nearest float is 0x1.5C87FAp-21.
Note that 0x1.5C87FA0000000p-21 is the nearest double to 7.038530691851209e-26
So yes, there can be a double-rounding problem (round off error twice in the same direction) with a relatively short literal...
float x = 7.038531e-26f; and float y = 7.038531e-26; should be two different numbers if the compiler rounds the literals correctly.
Is literal double to float conversion equal to float literal?
Usually yes, but not always equal.
Code to double then to float (vs. code to float) potentially incurs double rounding trouble.
Problem seen with any code value (even possible without a fraction) that is near the half-way case of two adjacent float values.
Occurrence: With random constants and typical float/double: about 1 in 230.

When is integer to floating point conversion lossless?

Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?
When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless
Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std​::​numeric_­limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.
Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)
If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");

What maximal integer, capable of adding 1 a float type can hold in c++?

Suppose I am using float to hold integer values and adding small shifts to it, approximately 1s or 2s. At which value float will stop to change? What is the name of this value?
The smallest positive value of an IEEE 754 floating-point variable a where you get a == a+1 is 2^bits_precision, where bits_precision is one more than the number of bits in the significand and can be found with std::numeric_limits<T>::digits.
For a 32-bit float, that's 24; for a 64-bit double, that's 53 (again, in the very common context of IEEE 754).
Demo

Losing precision in stringstream

In one of my applications I am trying to put a float value into a string stream like this:
stream << static_cast<float>(double value);
Instead of getting the entire float value I get only the integer part of it. Any idea why that might happen?
You're casting to a float - which C++ defines as an IEEE 754 32-bit 'single precision' floating point type.
If you look up the format of such a value, the 32 bits are split between three components:
23 bits to store the significand
8 bits to store the exponent
1 bit to store the sign.
If you have 23 bits to store the signifcand, that means the largest value you could represent in the significand is 2^23. As a result, single-precision floating points only have about 6-9 digits of precision.
If you have a floating point value that has 9 or more digits before the decimal point - if it exceeds 2^23 - you will never have a fractional component.
To help that sink in, consider the following code:
void Test()
{
float test = 8388608.0F;
while( test > 0.0F )
{
test -= 0.1F;
}
}
That code never terminates. Every time we try to decrement test by 0.1, the change in magnitude is lost because we don't have the precision to store it, so the value ends up right back at 8388608.0. No progress can ever be made, so it never terminates. This is true of all limited precision floating point types, so you'd find that this same problem would happen for IEEE 754 double precision floating point types (64-bit) all the same, just at a different, larger value.
Also, if your goal is to preserve as much precision as possible, then it does not make sense to cast from double to float. double is a 64-bit floating point type; float is a 32-bit floating point type. If you used double, you might be able to avoid most of the truncation if your values are small enough.

Check if float can be represented as integral type

How do I check whether a float can be represented as an integral type without invoking undefined behavior by just casting? This is forbidden by §4.9.1:
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion truncates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type,
There's this question for C, but the accepted answer clearly causes undefined behavior (first by just plain casting and the later by using the union hack, which makes the whole thing very questionable to me).
I can see how it'd be hard to have a fully compliant solution, but one that is implementation defined (to assume IEEE-754 floats) would be acceptable too.
Check truncf(x) == x. (The function is in <math.h>) This will compare true if and only if x has no fractional part. Then, compare x to the range of the type.
Example Code (Untested)
#include <cfenv>
#include <cmath>
#pragma STDC FENV_ACCESS on
template<class F, class N> // F is a float type, N integral.
bool is_representable(const F x)
{
const int orig_rounding = std::fegetround();
std::fesetround(FE_TOWARDZERO);
const bool to_return = std::trunc(x) == x &&
x >= std::numeric_limits<N>.min() &&
x <= std::numeric_limits<N>.max();
std::fesetround(orig_rounding);
return to_return;
}
With rounding set toward zero, the implicit conversion of the minimum and maximum values of the integral type to the floating-point type should not overflow. On many architectures, including i386, casting to long double will also provide enough precision to exactly represent a 64-bit int.
You could use the %a format type specifier in snprintf() to get access to the mantissa and the exponent and you can then work out if the number is an integer or not and if it can fit into an integer type of a specific size. I used this method in a different problem.
Let's consider positive integers from 1 to MAX_FLOAT. For IEEE 754 The exponent represents the power of 2 (it's stored with a bias).
IEE-754 format uses 1 sign bit, 8 biased exponent bits and 23 mantissa bits.The mantissa
is a factional part of the number. The highest bit is 1/2, the next 1/4, the next 1/8 ...
Numbers from 1 to 1.999... have the same exponent. There is 1 integer in the range (1)
Numbers from 2 to 3.999... have the same exponent. There are 2 integers in the range (2,3).
3 has the highest mantissa bit set so it requires the leading mantissa bit.
If any other lower mantissa bits are then its not an integer. Because it's 2 or 3
plus the the value of the fractional bit.
Numbers from 4 to 7.999... have the same exponent. There are 4 integers in the range (4,5,6,7)
These use the 2 highest mantissa bits. If any other mantissa bits are set then its not an integer.
Numbers from 8 to 15.999... have the same exponent. There are 8 integers in the range (8,9,10,11,12,13,14,15)
These use the 3 highest mantissa bits. If any other mantissa bits are set then its not an integer.
I hope you can see the pattern as you increase the exponent, the number of possible integers doubles. So ignore the n highest mantissa bits and test if the lower bits are set. If they are then the number is not an integer.
This tables shows the constant exponent value 0x40 plus the next highest bit, and that only the high order mantissa bits are set for integers
Float Hex
4 0x40800000
5 0x40a00000
6 0x40c00000
7 0x40e00000
To convert a float to a UInt32
float x = 7.0;
UInt32 * px = (UInt32*)&x;
UInt32 i = *px;