C++17 Hexidecimal floating point literal single precision suffix conflict? - c++

I was looking at the C++17 spec for floating point literals and found a problem. How do you tell the difference between the digit F and the suffix F for single precision?
For instance, does the literal 0x1p0F translate to a double precision 32768.0L or a single precision 1.0F?
The spec says that the suffix is optional, and that no suffix indicates double precision, so, as written, there is a definite ambiguity.

A hex-float literal must use a p exponent. The exponent is defined using non-hexadecimal digits (a decimal integer that represents the exponent to be applied to 2). Therefore, it cannot contain "A-F" characters. So there is no ambiguity. 0x1p0F has an exponent of "0", and is of type float.

Related

Formatting a parsable float literal

I need to output float numbers using printf formatting, in such a way that the generated string are valid float literals. As the numbers are arbitrary, I use the %g format descriptor, followed by an f.
This raises a problem in one specific case: if the mantissa in an integer and no exponent is appended. In this case, the number is printed without the decimal point nor exponent, resulting in an illegal constant.
E.g.
3.3 3.3f ok
3000000000000000. 3e+015f ok
0.000000000000003 3e-015f ok
3 3f nok
Do you see an easy way to handle this corner case ?
if the mantissa in an integer and no exponent is appended. In this case, the number is printed without the decimal point nor exponent
So use the alternative form with %#g. It will always print a decimal separator, but it will also disable removing trailing zeros from the output.

strtof() function misplacing decimal place

I have a string "1613894376.500012077" and I want to use strtof in order to convert to floating point 1613894376.500012077. The problem is when I use strtof I get the following result with the decimal misplaced 1.61389e+09. Please help me determine how to use strof properly.
A typical float is 32-bit and can only represent exactly about 232 different values. "1613894376.500012077" is not one of those.
"1.61389e+09" is the same value as "1613890000.0" and represents a close value that float can represent.
The 2 closest floats are:
1613894272.0
1613894400.0 // slightly closer to 1613894376.500012077
Print with more precision to see more digits.
The decimal point is not misplaced. The notation “1.61389e+09” means 1.61389•109, which is 1,613,890,000., which has the decimal point in the correct place.
The actual result of strtof in your computer is probably 1,613,894,400. This is the closest value to 1613894376.500012077 that the IEEE-754 binary32 (“single”) format can represent, and that is the format commonly used for float. When you print it with %g, the default is to use just six significant digits. To see it with more precision, print it with %.999g.
The number 1613894376.500012077 is equivalent (the same number up to the precision of the machine as 1.61389e+09.) The e+09 suffix means that the decimal point is located nine decimal digits right the place it has been placed (or that the number is multiplied by 10 to the ninth power). This is a common notation in computer science called scientific notation.

Maximum Decimal Exponent IEEE 754

The Wikipedia page on the IEEE 754 standard contains a table that summarizes different floating point representation formats. Here is an excerpt.
The meaning of the Decimal digits column is the number of digits the represented number its mantissa has if you convert it to decimal. The page states that it is computed by (Significand bits)*log_10(2). I can see how that makes sense.
However, I don't see what the meaning is of the Decimal E max column. It is computed by (E max)*log_10(2) and is supposed to be "the maximum exponent in decimal". But isn't E max the maximum exponent in decimal?
I'm asking because these 'decimal' values are the values (I think) that can be passed to selected_real_kind in Fortran. If you define a real with kind selected_real_kind(6, 37) it will be single precision. There will be (at least) 6 significand digits in your decimal number. So a similar question is, what is the meaning of the 37? This is also the value returned by Fortran's range. The GNU Fortran docs state that "RANGE(X) returns the decimal exponent range in the model of the type of X", but it doesn't help me understand what it means.
I always come up with an answer myself minutes after I've posted it on StackExchange even though I've been thinking about it all day...
The number in binary is represented by m*2^(e) with m the mantissa and e the exponent in binary. The maximum value of e for single precision is 127.
The number converted to decimal can be represented by m*10^(e) with m the mantissa and e the exponent in decimal. To have the same (single) precision here, e has a maximum value of 127*log_10(2) = 38.23. You can also see this by noticing m*10^(127*log_10(2)) = m*2^(127).

C++ gcc large number errors

My c++ project needs to work with numbers of planet masses... up to over 24 digits. They are floats. The same variable may also be a relatively small number (100) I have tried using double, and long, but compiling in linux with G++ I am receiving the warning: warning:
integer constant is too large for its type [enabled by default].
Also my calculations do not work because of this. I am wondering what type variable this kind of number will require.
I have done research, but it's turned up dry.. still, my apologies if this question is frequent. Thank you!
If you have a piece of code like:
double mass = 31415926535892718281828459;
then you need to understand that the constant is an integer. The whole statement would turn it into a double before putting it into mass but your scheme is failing before that point.
You need to tell the compiler it's a double straight away with something like:
double mass = 31415926535892718281828459.0;
Section 2.14 of C++11 details the literals and how they're defined. A group of digits, where the first isn't 0, is captured by the following rule of section 2.14.2 Integer literals:
decimal-literal:
nonzero-digit
decimal-literal digit
(a group of digits starting with 0 is still an integer, just one made out of octal digits rather than decimal ones).
Section 2.14.4 Floating literals shows how to instruct the compiler that you want a double such as, for example:
including a fractional component as in 1.414 or 15.; or
using the exponent notation as in 12e2.
Or, for the language lawyers out there:
A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits. Either the integer part or the fraction part (not both) can be omitted; either the decimal point or the letter e (or E) and the exponent (not both) can be omitted.
The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double.
You need to make sure it is double:
123456789012345678 // integer, give warning
123456789012345678.0 // double (floating point)
If you need extra precision, you should consider using a large number library. See also C++ library for big float numbers
Here's a simple test case that produces this warning:
float foo() {
return 1000000000000000000000000;
}
The problem is that the number written there is actually an integer literal. This code is basically saying "take this value as an int, convert it to float, and return that." But the number is too big to fit in an int.
Solution: add ".0" or ".0f" to the end of the number to make it a double or float literal instead of an int.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.