Consvertion from int to float and back gives different results C++ [duplicate] - c++

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?

Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.

Related

Conversion between 4-byte IBM floating-point and IEEEE [duplicate]

I need to read values from a binary file. The data format is IBM single Precision Floating Point (4-byte Hexadecimal Exponent Data). I have C++ code that reads from the file and takes out each byte and stores it like so
unsigned char buf[BUF_LEN];
for (long position = 0; position < fileLength; position += BUF_LEN) {
file.read((char* )(&buf[0]), BUF_LEN);
// printf("\n%8ld: ", pos);
for (int byte = 0; byte < BUF_LEN; byte++) {
// printf(" 0x%-2x", buf[byte]);
}
}
This prints out the hexadecimal values of each byte.
this picture specifies IBM single precision floating point
IBM single precision floating point
How do I convert the buffer into floating point values?
The format is actually quite simple, and not particularly different than IEEE 754 binary32 format (it's actually simpler, not supporting any of the "magic" NaN/Inf values, and having no subnormal numbers, because the mantissa here has an implicit 0 on the left instead of an implicit 1).
As Wikipedia puts it,
The number is represented as the following formula: (−1)sign × 0.significand × 16exponent−64.
If we imagine that the bytes you read are in a uint8_t b[4], then the resulting value should be something like:
uint32_t mantissa = (b[1]<<16) | (b[2]<<8) | b[3];
int exponent = (b[0] & 127) - 64;
double ret = mantissa * exp2(-24 + 4*exponent);
if(b[0] & 128) ret *= -1.;
Notice that here I calculated the result in a double, as the range of a IEEE 754 float is not enough to represent the same-sized IBM single precision value (also the opposite holds). Also, keep in mind that, due to endian issues, you may have to revert the indexes in my code above.
Edit: #Eric Postpischil correctly points out that, if you have C99 or POSIX 2001 available, instead of mantissa * exp2(-24 + 4*exponent) you should use ldexp(mantissa, -24 + 4*exponent), which should be more precise (and possibly faster) across implementations.

Why does shifting 0xff left by 24 bits result in an incorrect value?

I would like to shift 0xff left by 3 bytes and store it in a uint64_t, which should work as such:
uint64_t temp = 0xff << 24;
This yields a value of 0xffffffffff000000 which is most definitely not the expected 0xff000000.
However, if I shift it by fewer than 3 bytes, it results in the correct answer.
Furthermore, trying to shift 0x01 left by 3 bytes does work.
Here's my output:
0xff shifted by 0 bytes: 0xff
0x01 shifted by 0 bytes: 0x1
0xff shifted by 1 bytes: 0xff00
0x01 shifted by 1 bytes: 0x100
0xff shifted by 2 bytes: 0xff0000
0x01 shifted by 2 bytes: 0x10000
0xff shifted by 3 bytes: 0xffffffffff000000
0x01 shifted by 3 bytes: 0x1000000
With some experimentation, left shifting works up to 3 bits for each uint64_t up to 0x7f, which yields 0x7f000000. 0x80 yields 0xffffffff80000000.
Does anyone have an explanation for this bizarre behavior? 0xff000000 certainly falls within the 264 - 1 limits of uint64_t.
Does anyone have an explanation for this bizarre behavior?
Yes, type of operation always depend on operand types and never on result type:
double r = 1.0 / 2.0;
// double divided by double and result double assigned to r
// r == 0.5
double r = 1.0 / 2;
// 2 converted to double, double divided by double and result double assigned to r
// r == 0.5
double r = 1 / 2;
// int divided by int, result int converted to double and assigned to r
// r == 0.0
When you understand and remenber this you would not fall on this mistake again.
I suspect the behavior is compiler dependent, but I am seeing the same thing.
The fix is simple. Be sure to cast the 0xff to a uint64_t type BEFORE performing the shift. That way the compiler will handle it as the correct type.
uint64_t temp = uint64_t(0xff) << 24
Shifting left creates a negative (32 bit) number which then gets filled to 64 bits.
Try
0xff << 24LL;
Let's break your problem up into two pieces. The first is the shift operation, and the other is the conversion to uint64_t.
As far as the left shift is concerned, you are invoking undefined behavior on 32-bit (or smaller) architectures. As others have mentioned, the operands are int. A 32-bit int with the given value would be 0x000000ff. Note that this is a signed number, so the left-most bit is the sign. According to the standard, if you the shift affects the sign-bit, the result is undefined. It is up to the whims of the implementation, it is subject to change at any point, and it can even be completely optimized away if the compiler recognizes it at compile-time. The latter is not realistic, but it is actually permitted. While you should never rely on code of this form, this is actually not the root of the behavior that puzzled you.
Now, for the second part. The undefined outcome of the left shift operation has to be converted to a uint64_t. The standard states for signed to unsigned integral conversions:
If the destination type is unsigned, the resulting value is the smallest unsigned value equal to the source value modulo 2n where n is the number of bits used to represent the destination type.
That is, depending on whether the destination type is wider or narrower, signed integers are sign-extended[footnote 1] or truncated and unsigned integers are zero-extended or truncated respectively.
The footnote clarifies that sign-extension is true only for two's-complement representation which is used on every platform with a C++ compiler currently.
Sign-extension means just that everything left of the sign bit on the destination variable will be filled with the sign-bit, which produces all the fs in your result. As you noted, you could left shift 0x7f by 3-bytes without this occurring, That's because 0x7f=0b01111111. After the shift, you get 0x7f000000 which is the largest signed int, ie the largest number that doesn't affect the sign bit. Therefore, in the conversion, a 0 was extended.
Converting the left operand to a large enough type solves this.
uint64_t temp = uint64_t(0xff) << 24

sign changes when going from int to float and back

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?
Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.
Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Why perform multiplication in this way?

I've run into this function:
static inline INT32 MPY48SR(INT16 o16, INT32 o32)
{
UINT32 Temp0;
INT32 Temp1;
// A1. get the lower 16 bits of the 32-bit param
// A2. multiply them with the 16-bit param
// A3. add 16384 (TODO: why?)
// A4. bitshift to the right by 15 (TODO: why 15?)
Temp0 = (((UINT16)o32 * o16) + 0x4000) >> 15;
// B1. Get the higher 16 bits of the 32-bit param
// B2. Multiply them with the 16-bit param
Temp1 = (INT16)(o32 >> 16) * o16;
// 1. Shift B to the left (TODO: why do this?)
// 2. Combine with A and return
return (Temp1 << 1) + Temp0;
}
The inline comments are mine. It seems that all it's doing is multiplying the two arguments. Is this right, or is there more to it? Why would this be done in such a way?
Those parameters don't represent integers. They represent real numbers in fixed-point format with 15 bits to the right of the radix point. For instance, 1.0 is represented by 1 << 15 = 0x8000, 0.5 is 0x4000, -0.5 is 0xC000 (or 0xFFFFC000 in 32 bits).
Adding fixed-point numbers is simple, because you can just add their integer representation. But if you want to multiply, you first have to multiply them as integers, but then you have twice as many bits to the right of the radix point, so you have to discard the excess by shifting. For instance, if you want to multiply 0.5 by itself in 32-bit format, you multiply 0x00004000 (1 << 14) by itself to get 0x10000000 (1 << 28), then shift right by 15 bits to get 0x00002000 (1 << 13). To get better accuracy, when you discard the lowest 15-bits, you want to round to the nearest number, not round down. You can do this by adding 0x4000 = 1 << 14. Then if the discarded 15 bits is less than 0x4000, it gets rounded down, and if it's 0x4000 or more, it gets rounded up.
(0x3FFF + 0x4000) >> 15 = 0x7FFF >> 15 = 0
(0x4000 + 0x4000) >> 15 = 0x8000 >> 15 = 1
To sum up, you can do the multiplication like this:
return (o32 * o16 + 0x4000) >> 15;
But there's a problem. In C++, the result of a multiplication has the same type as its operands. So o16 is promoted to the same size as o32, then they are multiplied to get a 32-bit result. But this throws away the top bits, because the product needs 16 + 32 = 48 bits for accurate representation. One way to do this is to cast the operands to 64 bits and then multiply, but that might be slower, and it's not supported on all machines. So instead it breaks o32 into two 16-bit pieces, then does two multiplications in 32-bits, and combines the results.
This implements multiplication of fixed-point numbers. The numbers are viewed as being in the Q15 format (having 15 bits in the fractional part).
Mathematically, this function calculates (o16 * o32) / 2^15, rounded to nearest integer (hence the 2^14 factor, which represents 1/2, added to a number in order to round it). It uses unsigned and signed 16-bit multiplications with 32-bit result, which are presumably supported by the instruction set.
Note that there exists a corner case, where each of the numbers has a minimal value (-2^15 and -2^31); in this case, the result (2^31) is not representable in the output, and gets wrapped over (becomes -2^31 instead). For all other combinations of o16 and o32, the result is correct.