Double to int conversion behind the scene? - c++

I am just curious to know what happens behind the scene to convert a double to int, say int(5666.1) ? Is that going to be more expensive than a static_cast of a child class to parent? Since the representation of the int and double are fundamentally different is there going to be temporaries created during the process and expensive too.

Any CPU with native floating point will have an instruction to convert floating-point to integer data. That operation can take from a few cycles to many. Usually there are separate CPU registers for FP and integers, so you also have to subsequently move the integer to an integer register before you can use it. That may be another operation, possibly expensive. See your processor manual.
PowerPC notably does not include an instruction to move an integer in an FP register to an integer register. There has to be a store from FP to memory and load to integer. You could therefore say that a temporary variable is created.
In the case of no hardware FP support, the number has to be decoded. IEEE FP format is:
sign | exponent + bias | mantissa
To convert, you have to do something like
// Single-precision format values:
int const mantissa_bits = 23; // 52 for double.
int const exponent_bits = 8; // 11 for double.
int const exponent_bias = 127; // 1023 for double.
std::int32_t ieee;
std::memcpy( & ieee, & float_value, sizeof (std::int32_t) );
std::int32_t mantissa = ieee & (1 << mantissa_bits)-1 | 1 << mantissa_bits;
int exponent = ( ieee >> mantissa_bits & (1 << exponent_bits)-1 )
- ( exponent_bias + mantissa_bits );
if ( exponent <= -32 ) {
mantissa = 0;
} else if ( exponent < 0 ) {
mantissa >>= - exponent;
} else if ( exponent + mantissa_bits + 1 >= 32 ) {
overflow();
} else {
mantissa <<= exponent;
}
if ( ieee < 0 ) mantissa = - mantissa;
return mantissa;
I.e., a few bit unpacking instructions and a shift.

There's invariably a dedicated FPU instruction that gets the job done, cvttsd2si if the code generator uses the Intel SSE2 instruction set. That's fast, but not as fast as a static cast. That doesn't usually require any code at all.

The static_cast is dependent on the compiler's C++ code generation, but generally has no runtime cost, as the pointer change is calculated at compile time based on the assumed information in the cast.
When you convert a double to int, on an x86 system the compiler will generate a FIST (Floating-Point/Integer Conversion) instruction, and the FPU will do the conversion. This conversion can be implemented in software, and is done this way on certain hardware, or if the program requires it. The GNU MPFR library is capable of doing double to int conversions, and will perform the same conversion on all hardware.

Related

Consvertion from int to float and back gives different results C++ [duplicate]

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?
Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.
Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element.
Is there an efficient way to implement this using AVX and AVX2 instructions only?
Currently I'm using a loop which extracts each element and applies the _lzcnt_u32 function.
Related: to bit-scan one large bitmap, see Count leading zeros in __m256i word which uses pmovmskb -> bitscan to find which byte to do a scalar bitscan on.
This question is about doing 8 separate lzcnts on 8 separate 32-bit elements when you're actually going to use all 8 results, not just select one.
float represents numbers in an exponential format, so int->FP conversion gives us the position of the highest set bit encoded in the exponent field.
We want int->float with magnitude rounded down (truncate the value towards 0), not the default rounding of nearest. That could round up and make 0x3FFFFFFF look like 0x40000000. If you're doing a lot of these conversions without doing any FP math, you could set the rounding mode in the MXCSR1 to truncation then set it back when you're done.
Otherwise you can use v & ~(v>>8) to keep the 8 most-significant bits and zero some or all lower bits, including a potentially-set bit 8 below the MSB. That's enough to ensure all rounding modes never round up to the next power of two. It always keeps the 8 MSB because v>>8 shifts in 8 zeros, so inverted that's 8 ones. At lower bit positions, wherever the MSB is, 8 zeros are shifted past there from higher positions, so it will never clear the most significant bit of any integer. Depending on how set bits below the MSB line up, it might or might not clear more below the 8 most significant.
After conversion, we use an integer shift on the bit-pattern to bring the exponent (and sign bit) to the bottom and undo the bias with a saturating subtract. We use min to set the result to 32 if no bits were set in the original 32-bit input.
__m256i avx2_lzcnt_epi32 (__m256i v) {
// prevent value from being rounded up to the next power of two
v = _mm256_andnot_si256(_mm256_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float
v = _mm256_srli_epi32(v, 23); // shift down the exponent
v = _mm256_subs_epu16(_mm256_set1_epi32(158), v); // undo bias
v = _mm256_min_epi16(v, _mm256_set1_epi32(32)); // clamp at 32
return v;
}
Footnote 1: fp->int conversion is available with truncation (cvtt), but int->fp conversion is only available with default rounding (subject to MXCSR).
AVX512F introduces rounding-mode overrides for 512-bit vectors which would solve the problem, __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);. But all CPUs with AVX512F also support AVX512CD so you could just use _mm512_lzcnt_epi32. And with AVX512VL, _mm256_lzcnt_epi32
#aqrit's answer looks like a more-clever use of FP bithacks. My answer below is based on the first place I looked for a bithack which was old and aimed at scalar so it didn't try to avoid double (which is wider than int32 and thus a problem for SIMD).
It uses HW signed int->float conversion and saturating integer subtracts to handle the MSB being set (negative float), instead of stuffing bits into a mantissa for manual uint->double. If you can set MXCSR to round down across a lot of these _mm256_lzcnt_epi32, that's even more efficient.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float suggests stuffing integers into the mantissa of a large double, then subtracting to get the FPU hardware to get a normalized double. (I think this bit of magic is doing uint32_t -> double, with the technique #Mysticial explains in How to efficiently perform double/int64 conversions with SSE/AVX? (which works for uint64_t up to 252-1)
Then grab the exponent bits of the double and undo the bias.
I think integer log2 is the same thing as lzcnt, but there might be an off-by-1 at powers of 2.
The Standford Graphics bithack page lists other branchless bithacks you could use that would probably still be better than 8x scalar lzcnt.
If you knew your numbers were always small-ish (like less than 2^23) you could maybe do this with float and avoid splitting and blending.
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit integer (with no paddding bits) by storing the integer in the mantissa while the exponent is set to 252. From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal).
To do this with AVX2, blend and shift+blend odd/even halves with set1_epi32(0x43300000) and _mm256_castps_pd to get a __m256d. And after subtracting, _mm256_castpd_si256 and shift / blend the low/high halves into place then mask to get the exponents.
Doing integer operations on FP bit-patterns is very efficient with AVX2, just 1 cycle of extra latency for a bypass delay when doing integer shifts on the output of an FP math instruction.
(TODO: write it with C++ intrinsics, edit welcome or someone else could just post it as an answer.)
I'm not sure if you can do anything with int -> double conversion and then reading the exponent field. Negative numbers have no leading zeros and positive numbers give an exponent that depends on the magnitude.
If you did want that, you'd go one 128-bit lane at a time, shuffling to feed xmm -> ymm packed int32_t -> packed double conversion.
The question is also tagged AVX, but there are no instructions for integer processing in AVX, which means one needs to fall back to SSE on platforms that support AVX but not AVX2. I am showing an exhaustively tested, but a bit pedestrian version below. The basic idea here is as in the other answers, in that the count of leading zeros is determined by the floating-point normalization that occurs during integer to floating-point conversion. The exponent of the result has a one-to-one correspondence with the count of leading zeros, except that the result is wrong in the case of an argument of zero. Conceptually:
clz (a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
where float_as_uint32() is a re-interpreting cast and uint32_to_float_rz() is a conversion from unsigned integer to floating-point with truncation. A normal, rounding, conversion could bump up the conversion result to the next power of two, resulting in an incorrect count of leading zero bits.
SSE does not provide truncating integer to floating-point conversion as a single instruction, nor conversions from unsigned integers. This functionality needs to be emulated. The emulation does not need to be exact, as long as it does not change the magnitude of the conversion result. The truncation part is handled by the invert - right shift - andn technique from aqrit's answer. To use signed conversion, we cut the number in half before the conversion, then double and increment after the conversion:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
This approach is translated into SSE intrinsics in sse_clz() below.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"
/* compute count of leading zero bits using floating-point normalization.
clz(a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
The problematic part here is uint32_to_float_rz(). SSE does not offer
conversion of unsigned integers, and no rounding modes in integer to
floating-point conversion. Since all we need is an approximate version
that preserves order of magnitude:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
*/
__m128i sse_clz (__m128i a)
{
__m128 fp1 = _mm_set_ps1 (1.0f);
__m128i zero = _mm_set1_epi32 (0);
__m128i i158 = _mm_set1_epi32 (158);
__m128i iszero = _mm_cmpeq_epi32 (a, zero);
__m128i lsr1 = _mm_srli_epi32 (a, 1);
__m128i lsr2 = _mm_srli_epi32 (a, 2);
__m128i atrunc = _mm_andnot_si128 (lsr2, lsr1);
__m128 atruncf = _mm_cvtepi32_ps (atrunc);
__m128 atruncf2 = _mm_add_ps (atruncf, atruncf);
__m128 conv = _mm_add_ps (atruncf2, fp1);
__m128i convi = _mm_castps_si128 (conv);
__m128i lsr23 = _mm_srli_epi32 (convi, 23);
__m128i res = _mm_sub_epi32 (i158, lsr23);
return _mm_sub_epi32 (res, iszero);
}
/* Portable reference implementation of 32-bit count of leading zeros */
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
/* Test floating-point based count leading zeros exhaustively */
int main (void)
{
__m128i res;
uint32_t resi[4], refi[4];
uint32_t count = 0;
do {
refi[0] = clz32 (count);
refi[1] = clz32 (count + 1);
refi[2] = clz32 (count + 2);
refi[3] = clz32 (count + 3);
res = sse_clz (_mm_set_epi32 (count + 3, count + 2, count + 1, count));
memcpy (resi, &res, sizeof resi);
if ((resi[0] != refi[0]) || (resi[1] != refi[1]) ||
(resi[2] != refi[2]) || (resi[3] != refi[3])) {
printf ("error # %08x %08x %08x %08x\n",
count, count+1, count+2, count+3);
return EXIT_FAILURE;
}
count += 4;
} while (count);
return EXIT_SUCCESS;
}

Conversion between 4-byte IBM floating-point and IEEEE [duplicate]

I need to read values from a binary file. The data format is IBM single Precision Floating Point (4-byte Hexadecimal Exponent Data). I have C++ code that reads from the file and takes out each byte and stores it like so
unsigned char buf[BUF_LEN];
for (long position = 0; position < fileLength; position += BUF_LEN) {
file.read((char* )(&buf[0]), BUF_LEN);
// printf("\n%8ld: ", pos);
for (int byte = 0; byte < BUF_LEN; byte++) {
// printf(" 0x%-2x", buf[byte]);
}
}
This prints out the hexadecimal values of each byte.
this picture specifies IBM single precision floating point
IBM single precision floating point
How do I convert the buffer into floating point values?
The format is actually quite simple, and not particularly different than IEEE 754 binary32 format (it's actually simpler, not supporting any of the "magic" NaN/Inf values, and having no subnormal numbers, because the mantissa here has an implicit 0 on the left instead of an implicit 1).
As Wikipedia puts it,
The number is represented as the following formula: (−1)sign × 0.significand × 16exponent−64.
If we imagine that the bytes you read are in a uint8_t b[4], then the resulting value should be something like:
uint32_t mantissa = (b[1]<<16) | (b[2]<<8) | b[3];
int exponent = (b[0] & 127) - 64;
double ret = mantissa * exp2(-24 + 4*exponent);
if(b[0] & 128) ret *= -1.;
Notice that here I calculated the result in a double, as the range of a IEEE 754 float is not enough to represent the same-sized IBM single precision value (also the opposite holds). Also, keep in mind that, due to endian issues, you may have to revert the indexes in my code above.
Edit: #Eric Postpischil correctly points out that, if you have C99 or POSIX 2001 available, instead of mantissa * exp2(-24 + 4*exponent) you should use ldexp(mantissa, -24 + 4*exponent), which should be more precise (and possibly faster) across implementations.

CPU and casting [duplicate]

I am just curious to know what happens behind the scene to convert a double to int, say int(5666.1) ? Is that going to be more expensive than a static_cast of a child class to parent? Since the representation of the int and double are fundamentally different is there going to be temporaries created during the process and expensive too.
Any CPU with native floating point will have an instruction to convert floating-point to integer data. That operation can take from a few cycles to many. Usually there are separate CPU registers for FP and integers, so you also have to subsequently move the integer to an integer register before you can use it. That may be another operation, possibly expensive. See your processor manual.
PowerPC notably does not include an instruction to move an integer in an FP register to an integer register. There has to be a store from FP to memory and load to integer. You could therefore say that a temporary variable is created.
In the case of no hardware FP support, the number has to be decoded. IEEE FP format is:
sign | exponent + bias | mantissa
To convert, you have to do something like
// Single-precision format values:
int const mantissa_bits = 23; // 52 for double.
int const exponent_bits = 8; // 11 for double.
int const exponent_bias = 127; // 1023 for double.
std::int32_t ieee;
std::memcpy( & ieee, & float_value, sizeof (std::int32_t) );
std::int32_t mantissa = ieee & (1 << mantissa_bits)-1 | 1 << mantissa_bits;
int exponent = ( ieee >> mantissa_bits & (1 << exponent_bits)-1 )
- ( exponent_bias + mantissa_bits );
if ( exponent <= -32 ) {
mantissa = 0;
} else if ( exponent < 0 ) {
mantissa >>= - exponent;
} else if ( exponent + mantissa_bits + 1 >= 32 ) {
overflow();
} else {
mantissa <<= exponent;
}
if ( ieee < 0 ) mantissa = - mantissa;
return mantissa;
I.e., a few bit unpacking instructions and a shift.
There's invariably a dedicated FPU instruction that gets the job done, cvttsd2si if the code generator uses the Intel SSE2 instruction set. That's fast, but not as fast as a static cast. That doesn't usually require any code at all.
The static_cast is dependent on the compiler's C++ code generation, but generally has no runtime cost, as the pointer change is calculated at compile time based on the assumed information in the cast.
When you convert a double to int, on an x86 system the compiler will generate a FIST (Floating-Point/Integer Conversion) instruction, and the FPU will do the conversion. This conversion can be implemented in software, and is done this way on certain hardware, or if the program requires it. The GNU MPFR library is capable of doing double to int conversions, and will perform the same conversion on all hardware.

sign changes when going from int to float and back

Consider the following code, which is an SSCCE of my actual problem:
#include <iostream>
int roundtrip(int x)
{
return int(float(x));
}
int main()
{
int a = 2147483583;
int b = 2147483584;
std::cout << a << " -> " << roundtrip(a) << '\n';
std::cout << b << " -> " << roundtrip(b) << '\n';
}
The output on my computer (Xubuntu 12.04.3 LTS) is:
2147483583 -> 2147483520
2147483584 -> -2147483648
Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...
Hm, on ideone, the output is different:
2147483583 -> 2147483520
2147483584 -> 2147483647
Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?
Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.
The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).
Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.
You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.
Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.
In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.
Here is how conversion from int->float typically looks like on low level:
fild dword ptr[your int]
fstp dword ptr[your float]
It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
(PROGRAMMING WITH THE X87 FPU):
When floating-point, integer, or packed BCD integer
values are loaded from memory into any of the x87 FPU data registers, the values are
automatically converted into double extended-precision floating-point format (if they
are not already in that format).
Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.
So far so good.
The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).
I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:
00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two
values are equally close, the result is the even value (that is, the
one with the least-significant bit of zero). Default
01 - toward -Inf
10 - toward +inf
11 - toward 0 (ie. truncate)
You can play with rounding modes using this simple code (although it may be done differently - showing low level here):
enum ROUNDING_MODE
{
RM_TO_NEAREST = 0x00,
RM_TOWARD_MINF = 0x01,
RM_TOWARD_PINF = 0x02,
RM_TOWARD_ZERO = 0x03 // TRUNCATE
};
void set_round_mode(enum ROUNDING_MODE rm)
{
short csw;
short tmp = rm;
_asm
{
push ax
fstcw [csw]
mov ax, [csw]
and ax, ~(3<<10)
shl [tmp], 10
or ax, tmp
mov [csw], ax
fldcw [csw]
pop ax
}
}
Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:
record sign
negate your int if less than zero
find position of leftmost 1
shift int to the right/left so that 1 found above is positioned on bit #23
record number of shifts during the process so that you can calculate exponent
And the code simulating this bahavior may look like this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
value = 0xCF000000;
return *((float*)&value);
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int ret = sign | exp | coeff;
return *((float*)&ret);
}
Now example - truncation mode converts 2147483583 to 2147483520.
2147483583 = 01111111_11111111_11111111_10111111
During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:
01111111 = 63
And 63 is what original number lost:
2147483583 -> 2147483520 + 63
Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:
67108871 = 00000100_00000000_00000000_00000111
Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):
00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).
Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.
Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.
67108871 = 00000100_00000000_00000000_00000111
As we know we need 3 right shifts to place leftmost 1 in bit #23:
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.
00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)
2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:
from top:
00000000_1.[0000000_00000000_00000000] 111 * 2^26
+1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872
and from below:
00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864
Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).
Now OP's numbers (still rounding to nearest even):
2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30
bracket values:
top:
00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
+1
= 00000000_10.[0000000_00000000_00000000] * 2^30
= 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom:
00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520
Last OP's case shows the problem where even word matters. :
2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30
bracket values are the same as previously:
top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648
bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520
There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.
And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.
Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:
unsigned int test = (1u << 31);
_asm
{
fild [test]
}
Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.
Was long but hope someone will learn something out of it.