How is float_max + 1 defined in C++? - c++

What is that actual value of f?
float f = std::numeric_limits<float>::max() + 1.0f;
For unsigned integral types it is well defined to overflow to 0, and for signed integral it is undefined/implementation specific if I'm not wrong.
But how is it specified in standard for float/double? Is it std::numeric_limits<float>::max() or does it become std::numeric_limits<float>::infinity()?
On cppreference I didn't find a specification so far, maybe I missed it.
Thanks for help!

In any rounding mode, max + 1 will simply be max with an IEEE-754 single-precision float.
Note that the maximum positive finite 32-bit float is:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11111110 11111111111111111111111
Hex: 7F7F FFFF
Precision: SP
Sign: Positive
Exponent: 127 (Stored: 254, Bias: 127)
Hex-float: +0x1.fffffep127
Value: +3.4028235e38 (NORMAL)
For this number to overflow and become infinity using the default rounding mode of round-nearest-ties-to-even, you have to add at least:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11100110 00000000000000000000000
Hex: 7300 0000
Precision: SP
Sign: Positive
Exponent: 103 (Stored: 230, Bias: 127)
Hex-float: +0x1p103
Value: +1.0141205e31 (NORMAL)
Anything you add less than this particular value will round it back to max value itself. Different rounding modes might have slightly different results, but the order of the number you're looking for is about 1e31, which is pretty darn large.
This is an excellent example of how IEEE floats get sparser and sparser as their magnitude increases.

Related

Is there a formula to find the numbers of bits for either exponent or significand in a floating point number?

Recently, I have been interested with using bit shiftings on floating point numbers to do some fast calculations.
To make them work in more generic ways, I would like to make my functions work with different floating point types, probably through templates, that is not limited to float and double, but also "halfwidth" or "quadruple width" floating point numbers and so on.
Then I noticed:
- Half --- 5 exponent bits --- 10 signicant bits
- Float --- 8 exponent bits --- 23 signicant bits
- Double --- 11 exponent bits --- 52 signicant bits
So far I thought exponent bits = logbase2(total byte) * 3 + 2,
which means 128bit float should have 14 exponent bits, and 256bit float should have 17 exponent bits.
However, then I learned:
- Quad --- 15 exponent bits --- 112 signicant bits
- Octuple--- 19 exponent bits --- 237 signicant bits
So, is there a formula to find it at all? Or, is there a way to call it through some builtin functions?
C or C++ are preferred, but open to other languages.
Thanks.
Characteristics Provided Via Built-In Functions
C++ provides this information via the std::numeric_limits template:
#include <iostream>
#include <limits>
#include <cmath>
template<typename T> void ShowCharacteristics()
{
int radix = std::numeric_limits<T>::radix;
std::cout << "The floating-point radix is " << radix << ".\n";
std::cout << "There are " << std::numeric_limits<T>::digits
<< " base-" << radix << " digits in the significand.\n";
int min = std::numeric_limits<T>::min_exponent;
int max = std::numeric_limits<T>::max_exponent;
std::cout << "Exponents range from " << min << " to " << max << ".\n";
std::cout << "So there must be " << std::ceil(std::log2(max-min+1))
<< " bits in the exponent field.\n";
}
int main()
{
ShowCharacteristics<double>();
}
Sample output:
The floating-point radix is 2.
There are 53 base-2 digits in the significand.
Exponents range from -1021 to 1024.
So there must be 11 bits in the exponent field.
C also provides the information, via macro definitions like DBL_MANT_DIG defined in <float.h>, but the standard defines the names only for types float (prefix FLT), double (DBL), and long double (LDBL), so the names in a C implementation that supported additional floating-point types would not be predictable.
Note that the exponent as specified in the C and C++ standards is one off from the usual exponent described in IEEE-754: It is adjusted for a significand scaled to [½, 1) instead of [1, 2), so it is one greater than the usual IEEE-754 exponent. (The example above shows the exponent ranges from −1021 to 1024, but the IEEE-754 exponent range is −1022 to 1023.)
Formulas
IEEE-754 does provide formulas for recommended field widths, but it does not require IEEE-754 implementations to conform to these, and of course the C and C++ standards do not require C and C++ implementations to conform to IEEE-754. The interchange format parameters are specified in IEEE 754-2008 3.6, and the binary parameters are:
For a floating-point format of 16, 32, 64, or 128 bits, the significand width (including leading bit) should be 11, 24, 53, or 113 bits, and the exponent field width should be 5, 8, 11, or 15 bits.
Otherwise, for a floating-point format of k bits, k should be a multiple of 32, and the significand width should be k−round(4•log2k)+13, and the exponent field should be round(4•log2k)−13.
The answer is no.
How many bits to use (or even which representation to use) is decided by compiler implementers and committees. And there's no way to guess what a committee decided (and no, it's not the "best" solution for any reasonable definition of "best"... it's just what happened that day in that room: an historical accident).
If you really want to get down to that level you need to actually test your code on the platforms you want to deploy to and add in some #ifdef macrology (or ask the user) to find which kind of system your code is running on.
Also beware that in my experience one area in which compilers are extremely aggressive (to the point of being obnoxious) about type aliasing is with floating point numbers.
I want to see if there's a formula is to say if 512bit float is put in as standard, it would automatically work with it, without the need of altering anything
I don't know of a published standard that guarantees the bit allocation for future formats (*). Past history shows that several considerations factor into the final choice, see for example the answer and links at Why do higher-precision floating point formats have so many exponent bits?.(*) EDIT: see note added at the end.
For a guessing game, the existing 5 binary formats defined by IEEE-754 hint that the number of exponent bits grows slightly faster than linear. One (random) formula that fits these 5 data points could be for example (in WA notation) exponent_bits = round( (log2(total_bits) - 1)^(3/2) ).
This would foresee that a hypothetical binary512 format would assign 23 bits to the exponent, though of course IEEE is not bound in any way by such second-guesses.
The above is just an interpolation formula that happens to match the 5 known exponents, and it is certainly not the only such formula. For example, searching for the sequence 5,8,11,15,19 on oeis finds 18 listed integer sequences that contain this as a subsequence.
[ EDIT ]   As pointed out in #EricPostpischil's answer, IEEE 754-2008 does in fact list the formula exponent_bits = round( 4 * log2(total_bits) - 13 ) for total_bits >= 128 (the formula actually holds for total_bits = 64, too, though it does not for = 32 or = 16).
The empirical formula above matches the reference IEEE one for 128 <= total_bits <= 1472, in particular IEEE also gives 23 exponent bits for binary512 and 27 exponent bits for binary1024.
UPDATE : I've now incorporated that into a single unified function that perfectly lines up with the official formula while incorporating the proper exponents for 16- and 32-bit formats, and how the bits are split between sign-bit, exponent bits, and mantissa bits.
inputs can be in # of bits, e.g. 64, a ratio like "2x", or even case-insensitive single letters :
"S" for 1x single, - "D" for 2x double,
"Q" for 4x quadruple, - "O" for 8x "octuple",
"X" for 16x he"X", - "T" for 32x "T"hirty-two,
-— all other inputs, missing, or invalid, defaults to 0.5x half-precision
gcat <( jot 20 | mawk '$!(_=NF)=(_+_)^($_)' ) \
<( jot - -1 8 | mawk '$!NF =(++_+_)^$(_--)"x"' ) |
{m,g}awk '
function _754(__,_,___) {
return \
(__=(__==___)*(_+=_+=_^=_<_) ? _--^_++ : ">"<__ ? \
(_+_)*(_*_/(_+_))^index("SDQOXT", toupper(__)) : \
__==(+__ "") ? +__ : _*int(__+__)*_)<(_+_) \
\
? "_ERR_754_INVALID_INPUT_" \
: "IEEE-754-fp:" (___=__) "-bit:" (_^(_<_)) "_s:"(__=int(\
log((_^--_)^(_+(__=(log(__)/log(--_))-_*_)/_-_)*_^(\
-((++_-__)^(--_<__) ) ) )/log(_))) "_e:" (___-++__) "_m"
}
function round(__,_) {
return \
int((++_+_)^-_+__)
}
function _4xlog2(_) {
return (log(_)/log(_+=_^=_<_))*_*_
}
BEGIN { CONVFMT = OFMT = "%.250g"
}
( $++NF = _754(_=$!__) ) \
( $++NF = "official-expn:" \
+(_=round(_4xlog2(_=_*32^(_~"[0-9.]+[Xx]")))-13) < 11 ? "n/a" :_) |
column -s':' -t | column -t | lgp3 5
.
2 _ERR_754_INVALID_INPUT_ n/a
4 _ERR_754_INVALID_INPUT_ n/a
8 IEEE-754-fp 8-bit 1_s 2_e 5_m n/a
16 IEEE-754-fp 16-bit 1_s 5_e 10_m n/a
32 IEEE-754-fp 32-bit 1_s 8_e 23_m n/a
64 IEEE-754-fp 64-bit 1_s 11_e 52_m 11
128 IEEE-754-fp 128-bit 1_s 15_e 112_m 15
256 IEEE-754-fp 256-bit 1_s 19_e 236_m 19
512 IEEE-754-fp 512-bit 1_s 23_e 488_m 23
1024 IEEE-754-fp 1024-bit 1_s 27_e 996_m 27
2048 IEEE-754-fp 2048-bit 1_s 31_e 2016_m 31
4096 IEEE-754-fp 4096-bit 1_s 35_e 4060_m 35
8192 IEEE-754-fp 8192-bit 1_s 39_e 8152_m 39
16384 IEEE-754-fp 16384-bit 1_s 43_e 16340_m 43
32768 IEEE-754-fp 32768-bit 1_s 47_e 32720_m 47
65536 IEEE-754-fp 65536-bit 1_s 51_e 65484_m 51
131072 IEEE-754-fp 131072-bit 1_s 55_e 131016_m 55
262144 IEEE-754-fp 262144-bit 1_s 59_e 262084_m 59
524288 IEEE-754-fp 524288-bit 1_s 63_e 524224_m 63
1048576 IEEE-754-fp 1048576-bit 1_s 67_e 1048508_m 67
0.5x IEEE-754-fp 16-bit 1_s 5_e 10_m n/a
1x IEEE-754-fp 32-bit 1_s 8_e 23_m n/a
2x IEEE-754-fp 64-bit 1_s 11_e 52_m 11
4x IEEE-754-fp 128-bit 1_s 15_e 112_m 15
8x IEEE-754-fp 256-bit 1_s 19_e 236_m 19
16x IEEE-754-fp 512-bit 1_s 23_e 488_m 23
32x IEEE-754-fp 1024-bit 1_s 27_e 996_m 27
64x IEEE-754-fp 2048-bit 1_s 31_e 2016_m 31
128x IEEE-754-fp 4096-bit 1_s 35_e 4060_m 35
256x IEEE-754-fp 8192-bit 1_s 39_e 8152_m 39
===============================================
Similar to the concept mentioned above, here's an alternative formula (just re-arranging some terms) that will calculate the unsigned integer range of the exponent ([32,256,2048,32768,524288], corresponding to [5,8,11,15,19]-powers-of-2) without needing to call the round function :
uint_range = ( 64 ** ( 1 + (k=log2(bits)-4)/2) )
*
( 2 ** -( (3-k)**(2<k) ) )
(a) x ** y means x-to-y-power
(b) 2 < k is a boolean condition that should just return 0 or 1.
The function shall be accurate from 16-bit to 256-bit, at least. Beyond that, this formula yields exponent sizes of
– 512-bit : 23
– 1024-bit : 27
– 2048-bit : 31
– 4096-bit : 35
(beyond-256 may be inaccurate. even 27-bit-wide exponent allows exponents that are +/- 67 million, and over 40-million decimal digits once you calculate 2-to-that-power.)
from there to IEEE 754 exponent is just a matter of log2(uint_range)

Represents fp16 minimum number in hex format

I need to use the min_value of float16 in my program, but don't want to explicitly writing it out in decimal format. I want to know how to represents it in hex format.
float FP16_MIN = 5.96e-8;
Based on the top answer I received, the hex code for fp16 min with denorm is 0001.
I want a function to do:
float min = fp16_min(0x1);
I found a similar function in line 185 of https://eigen.tuxfamily.org/dox/Half_8h_source.html, but I didn't understand the implementation.
For FP16, the minimum positive normal value is:
1 0
5 43210 9876543210
S -E5-- ---F10----
Binary: 0 00001 0000000000
Hex: 0400
Precision: HP
Sign: Positive
Exponent: -14 (Stored: 1, Bias: 15)
Hex-float: +0x1p-14
Value: +6.1035156e-5 (NORMAL)
The minimum positive subnormal value is:
1 0
5 43210 9876543210
S -E5-- ---F10----
Binary: 0 00000 0000000001
Hex: 0001
Precision: HP
Sign: Positive
Exponent: -14 (Stored: 0, Bias: 14)
Hex-float: +0x1p-24
Value: +5.9604645e-8 (DENORMAL)
You can write the former as 0x1p-14 and the latter as 0x1p-24 in your program.
If you want to convert from the underlying hexadecimal representation, then a common trick is to use a union in C and a memcpy in C++. See this answer for details: How is 1 encoded in C/C++ as a float (assuming IEEE 754 single precision representation)?
Of course, to do this properly, you'd need an underlying 16-bit float type; which is typically not available. So, you'll have to first figure out what the corresponding hexadecimal would be in the 32-bit single precision format. For 1p-24 that's easy to compute in single precision:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 01100111 00000000000000000000000
Hex: 3380 0000
Precision: SP
Sign: Positive
Exponent: -24 (Stored: 103, Bias: 127)
Hex-float: +0x1p-24
Value: +5.9604645e-8 (NORMAL)
So the corresponding representation as a single precision float would be 0x33800000. (This is not hard to see: the bias for 32-bit float is 127, so you'd just put 103 in the exponent to get -24. I trust you can do that easily yourself; if not ask away.)
Now you can write:
#include <inttypes.h>
#include <iostream>
int main(void) {
uint32_t abc = 0x33800000;
float i;
std::memcpy(&i, &abc, 4);
std::cout<< i << std::endl;
return 0;
}
which prints:
5.96046e-08

Why is absolute value of INT_MIN different from INT MAX? [duplicate]

This question already has answers here:
What is “two's complement”?
(24 answers)
Closed 7 years ago.
I'm trying to understand why INT_MIN is equal to -2^31 - 1 and not just -2^31.
My understanding is that an int is 4 bytes = 32 bits. Of these 32 bits, I assume 1 bit is used for the +/- sign, leaving 31 bits for the actual value. As such, INT_MAX is equal to 2^31-1 = 2147483647. On the other hand, why is INT_MIN equal to -2^31 = -2147483648? Wouldn't this exceed the '4 bytes' allotted for int? Based on my logic, I would have expected INT_MIN to equal -2^31 = -2147483647
Most modern systems use two's complement to represent signed integer data types. In this representation, one state in the positive side is used up to represent zero, hence one positive value lesser than the negatives. In fact this is one of the prime advantage this system has over the sign-magnitude system, where zero has two representations, +0 and -0. Since zero has only one representation in two's complement, the other state, now free, is used to represent one more number.
Let's take a small data type, say 4 bits wide, to understand this better. The number of possible states with this toy integer type would be 2⁴ = 16 states. When using two's complement to represent signed numbers, we would have 8 negative and 7 positive numbers and zero; in sign-magnitude system, we'd get two zeros, 7 positive and 7 negative numbers.
Bin Dec
0000 = 0
0001 = 1
0010 = 2
0011 = 3
0100 = 4
0101 = 5
0110 = 6
0111 = 7
1000 = -8
1001 = -7
1010 = -6
1011 = -5
1100 = -4
1101 = -3
1110 = -2
1111 = -1
I think you are confused since you are imagining that sign-magnitude representation is used for signed numbers; although this is also allowed by the language standards, this system is very less likely to be implemented as two's complement system is significantly a better representation.
As of C++20, only two's complement is allowed for signed integers; source.

How to represent a negative number with a fraction in 2's complement?

So I want to represent the number -12.5. So 12.5 equals to:
001100.100
If I don't calculate the fraction then it's simple, -12 is:
110100
But what is -12.5? is it 110100.100? How can I calculate this negative fraction?
With decimal number systems, each number position (or column) represents (reading a number from right to left): units (which is 10^0), tens (i.e. 10^1),hundreds (i.e. 10^2), etc.
With unsigned binary numbers, the base is 2, thus each position becomes (again, reading from right to left): 1 (i.e. 2^0) ,2 (i.e. 2^1), 4 (i.e. 2^2), etc.
For example
2^2 (4), 2^1 (2), 2^0 (1).
In signed twos-complement the most significant bit (MSB) becomes negative. Therefore it represent the number sign: '1' for a negative number and '0' for a positive number.
For a three bit number the rows would hold these values:
-4, 2, 1
0 0 1 => 1
1 0 0 => -4
1 0 1 => -4 + 1 = -3
The value of the bits held by a fixed-point (fractional) system is unchanged. Column values follow the same pattern as before, base (2) to a power, but with power going negative:
2^2 (4), 2^1 (2), 2^0 (1) . 2^-1 (0.5), 2^-2 (0.25), 2^-3 (0.125)
-1 will always be 111.000
-0.5 add 0.5 to it: 111.100
In your case 110100.10 is equal to -32+16+4+0.5 = -11.5. What you did was create -12 then add 0.5 rather than subtract 0.5.
What you actually want is -32+16+2+1+0.5 = -12.5 = 110011.1
you can double the number again and again until it's negative integer or reaches a defined limit and then set the decimal point correspondingly.
-25 is 11100111, so -12.5 is 1110011.1
So;U want to represent -12.5 in 2's complement representation
12.5:->> 01100.1
2's complement of (01100.1):->>10011.1
verify the ans by checking the weighted code property of 2's complement representation(MSB weight is -ve). we will get -16+3+.5=-12.5

represent negative number with 2' complement technique?

I am using 2' complement to represent a negative number in binary form
Case 1:number -5
According to the 2' complement technique:
Convert 5 to the binary form:
00000101, then flip the bits
11111010, then add 1
00000001
=> result: 11111011
To make sure this is correct, I re-calculate to decimal:
-128 + 64 + 32 + 16 + 8 + 2 + 1 = -5
Case 2: number -240
The same steps are taken:
11110000
00001111
00000001
00010000 => recalculate this I got 16, not -240
I am misunderstanding something?
The problem is that you are trying to represent 240 with only 8 bits. The range of an 8 bit signed number is -128 to 127.
If you instead represent it with 9 bits, you'll see you get the correct answer:
011110000 (240)
100001111 (flip the signs)
+
000000001 (1)
=
100010000
=
-256 + 16 = -240
Did you forget that -240 cannot be represented with 8 bits when it is signed ?
The lowest negative number you can express with 8 bits is -128, which is 10000000.
Using 2's complement:
128 = 10000000
(flip) = 01111111
(add 1) = 10000000
The lowest negative number you can express with N bits (with signed integers of course) is always - 2 ^ (N - 1).