Conversion between 4-byte IBM floating-point and IEEEE [duplicate]

Conversion between 4-byte IBM floating-point and IEEEE [duplicate] - c++

I need to read values from a binary file. The data format is IBM single Precision Floating Point (4-byte Hexadecimal Exponent Data). I have C++ code that reads from the file and takes out each byte and stores it like so
unsigned char buf[BUF_LEN];
for (long position = 0; position < fileLength; position += BUF_LEN) {
file.read((char* )(&buf[0]), BUF_LEN);
// printf("\n%8ld: ", pos);
for (int byte = 0; byte < BUF_LEN; byte++) {
// printf(" 0x%-2x", buf[byte]);
}
}
This prints out the hexadecimal values of each byte.
this picture specifies IBM single precision floating point
IBM single precision floating point
How do I convert the buffer into floating point values?

The format is actually quite simple, and not particularly different than IEEE 754 binary32 format (it's actually simpler, not supporting any of the "magic" NaN/Inf values, and having no subnormal numbers, because the mantissa here has an implicit 0 on the left instead of an implicit 1).
As Wikipedia puts it,
The number is represented as the following formula: (−1)sign × 0.significand × 16exponent−64.
If we imagine that the bytes you read are in a uint8_t b[4], then the resulting value should be something like:
uint32_t mantissa = (b[1]<<16) | (b[2]<<8) | b[3];
int exponent = (b[0] & 127) - 64;
double ret = mantissa * exp2(-24 + 4*exponent);
if(b[0] & 128) ret *= -1.;
Notice that here I calculated the result in a double, as the range of a IEEE 754 float is not enough to represent the same-sized IBM single precision value (also the opposite holds). Also, keep in mind that, due to endian issues, you may have to revert the indexes in my code above.
Edit: #Eric Postpischil correctly points out that, if you have C99 or POSIX 2001 available, instead of mantissa * exp2(-24 + 4*exponent) you should use ldexp(mantissa, -24 + 4*exponent), which should be more precise (and possibly faster) across implementations.

Related

Consvertion from int to float and back gives different results C++ [duplicate]

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?

Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?

For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.

Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

sign changes when going from int to float and back

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?

Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.

Checksum of floats with roundtrip through text file

I need to write a couple of floats to a text file and store a CRC32 checksum with them. Then when I read the floats back from the text file, I want to recompute the checksum and compare it to the one that was previously computed when saving the file. My problem is that the checksum sometimes fails. This is due to the fact that equal floating point numbers can be represented by different bit patterns. For completeness' sake, I will summarize the code in the next paragraphs.
I have adapted this CRC32 algorithm which I found after reading this question. Here's what it looks like:
uint32_t updC32(uint32_t octet, uint32_t crc) {
return CRC32Tab[(crc ^ octet) & 0xFF] ^ (crc >> 8);
}
template <typename T>
uint32_t updateCRC32(T s, uint32_t crc) {
const char* buf = reinterpret_cast<const char*>(&s);
size_t len = sizeof(T);
for (; len; --len, ++buf)
crc = updC32(static_cast<uint32_t>(*buf), crc);
return crc;
}
CRC32Tab contains exactly the same values as the large array in the file linked above.
This is an abbreviated version of how I write the floats to a file and compute the checksum:
float x, y, z;
// set them to some values
uint32_t crc = 0xFFFFFFFF;
crc = Utility::updateCRC32(x, crc);
crc = Utility::updateCRC32(y, crc);
crc = Utility::updateCRC32(z, crc);
const uint32_t actualCrc = ~crc;
// stream is a FILE pointer, and I don't mind the scientific representation
fprintf(stream, " ( %g %g %g )", x, y, z);
fprintf(stream, " CRC %u\n", actualCrc);
I read the values back from the file as follows. There is actually a lot more involved as the file has a more complex syntax and has to be parsed, but let's assume that getNextFloat() returns the textual representation of each float written before.
float x = std::atof(getNextFloat());
float y = std::atof(getNextFloat());
float z = std::atof(getNextFloat());
uint32_t crc = 0xFFFFFFFF;
crc = Utility::updateCRC32(x, crc);
crc = Utility::updateCRC32(y, crc);
crc = Utility::updateCRC32(z, crc);
const uint32_t actualCrc = ~crc;
const uint32_t fileCrc = // read the CRC from the file
assert(fileCrc == actualCrc); // fails often, but not always
The source of this problem to be that std::atof will return a different bit representation of the float encoded in the string which was read from the file than the bit representation of the float that was used to write that string to the file.
So, my question is: Is there another way to achieve my goal of checksumming floats which are roundtripped through a textual representation other than to checksum the strings themselves?
Thanks for reading!

The source of the issue is apparent from your comment:
If I'm not completely mistaken, there is no rounding happening here. The %g specifier chooses the shortest string representation that exactly represents the number.
This is incorrect. If no precision is specified, it defaults to 6, and rounding will definitely occur for most floating-point inputs.
If you need a human-readable round-trippable format, %a is by far the best-choice. Failing that, you will need to specify a precision of at least 9 (assuming that float on your system is IEEE-754 single precision).
You may still be tripped up by NaN encodings, since the standard does not specify how or if they must be printed.

If the text file doesn't have to be human-readable, use hexadecimal float literals instead, they are exact so you won't have this problem of differences between textual and in-memory values.

If your standard library's float-to-text and text-to-float conversions do proper rounding, you just need enough sigificant digits for the float->text->float roundtrip to be lossless unless you also have Infs and NaNs, still it should be "value-preserving", not necessarily bitpattern preserving since there are multiple representations for infinity or NaN, I think. For an IEEE-754 64 bit double 17 significant digits is just enough to make the roundtrip lossless with respect to the actual value.

Your CRC algorithm is flawed for any type which has multiple binary representations for a single value. IEEE 754 has two representations for 0.0, to wit +0.0 and -0.0. Other, non-finite values such as NaN are potentially troublesome too.

Would it be acceptable to canonicalize your numbers before you update the CRC? So while saving, you would get a temporary string version of your number (with sprintf or whatever matches your serialization's format), then convert this string back to a numeric value, and then use this result to update the CRC. This way, you know that the CRC will match the deserialized value.

Double to int conversion behind the scene?

I am just curious to know what happens behind the scene to convert a double to int, say int(5666.1) ? Is that going to be more expensive than a static_cast of a child class to parent? Since the representation of the int and double are fundamentally different is there going to be temporaries created during the process and expensive too.

Any CPU with native floating point will have an instruction to convert floating-point to integer data. That operation can take from a few cycles to many. Usually there are separate CPU registers for FP and integers, so you also have to subsequently move the integer to an integer register before you can use it. That may be another operation, possibly expensive. See your processor manual.
PowerPC notably does not include an instruction to move an integer in an FP register to an integer register. There has to be a store from FP to memory and load to integer. You could therefore say that a temporary variable is created.
In the case of no hardware FP support, the number has to be decoded. IEEE FP format is:
sign | exponent + bias | mantissa
To convert, you have to do something like
// Single-precision format values:
int const mantissa_bits = 23; // 52 for double.
int const exponent_bits = 8; // 11 for double.
int const exponent_bias = 127; // 1023 for double.
std::int32_t ieee;
std::memcpy( & ieee, & float_value, sizeof (std::int32_t) );
std::int32_t mantissa = ieee & (1 << mantissa_bits)-1 | 1 << mantissa_bits;
int exponent = ( ieee >> mantissa_bits & (1 << exponent_bits)-1 )
- ( exponent_bias + mantissa_bits );
if ( exponent <= -32 ) {
mantissa = 0;
} else if ( exponent < 0 ) {
mantissa >>= - exponent;
} else if ( exponent + mantissa_bits + 1 >= 32 ) {
overflow();
} else {
mantissa <<= exponent;
}
if ( ieee < 0 ) mantissa = - mantissa;
return mantissa;
I.e., a few bit unpacking instructions and a shift.

There's invariably a dedicated FPU instruction that gets the job done, cvttsd2si if the code generator uses the Intel SSE2 instruction set. That's fast, but not as fast as a static cast. That doesn't usually require any code at all.

The static_cast is dependent on the compiler's C++ code generation, but generally has no runtime cost, as the pointer change is calculated at compile time based on the assumed information in the cast.
When you convert a double to int, on an x86 system the compiler will generate a FIST (Floating-Point/Integer Conversion) instruction, and the FPU will do the conversion. This conversion can be implemented in software, and is done this way on certain hardware, or if the program requires it. The GNU MPFR library is capable of doing double to int conversions, and will perform the same conversion on all hardware.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Conversion between 4-byte IBM floating-point and IEEEE [duplicate] - c++

Related

Consvertion from int to float and back gives different results C++ [duplicate]

When Will static_casting the Result of ceil Compromise the Result?

sign changes when going from int to float and back

Checksum of floats with roundtrip through text file

Double to int conversion behind the scene?

Categories

Resources