Hexadecimal floating point literals - c++

Is it possible to initialize a float variable with a hexadecimal float point value in C++?
Something like this:
double d = 0x011.1; // wrong!

The Technical Specification P0245 Hexadecimal floating literals for C++ has been voted into C++17 at the ISO C++ Standards Committee in Jacksonville, Florida on February 2016.
The language C99 also has this feature, and the C++ feature is compatible.
However, as pointed by the Lưu Vĩnh Phúc's comment, the syntax 0x011.1 is not part of the standard. The binary exponent is mandatory for hexadecimal floating-point literals. One reason is to avoid ambiguity of the trailing F within 0x011.1F. Is it the hex digit F of the fractional part or the floating-suffix meaning float?
Therefore append p followed by a positive or negative decimal number, for example: 0x011.1p0.
See the more readable page floating literal page on cppreference.com.
0x | 0X hex-digit-sequence
0x | 0X hex-digit-sequence .
0x | 0X hex-digit-sequence(optional) . hex-digit-sequence
Hexadecimal digit-sequence representing a whole number without a radix separator. The exponent is never optional for hexadecimal floating-point literals: 0x1ffp10, 0X0p-1, 0x1.p0, 0xf.p-1, 0x0.123p-1, 0xa.bp10l
The exponent syntax for hexadecimal floating-point literal has the form
p | P exponent-sign(optional) digit-sequence
exponent-sign, if present, is either + or -
suffix, if present, is one of f, F, l, or L. The suffix determines the type of the floating-point literal:
(no suffix) defines double
f F defines float
l L defines long double
See also the current working draft C++17, chapter § 2.13.4 Floating literals on GitHub: https://github.com/cplusplus/draft/raw/master/papers/n4604.pdf
floating-literal:
  decimal-floating-literal
  hexadecimal-floating-literal
decimal-floating-literal:
  fractional-constant exponent-partopt floating-suffixopt
  digit-sequence exponent-part floating-suffixopt
hexadecimal-floating-literal:
  hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part floating-suffixopt
  hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part floating-suffixopt
fractional-constant:
  digit-sequenceopt . digit-sequence
  digit-sequence .
hexadecimal-fractional-constant:
  hexadecimal-digit-sequenceopt . hexadecimal-digit-sequence
  hexadecimal-digit-sequence .
exponent-part:
  e signopt digit-sequence
  E signopt digit-sequence
binary-exponent-part:
  p signopt digit-sequence
  P signopt digit-sequence
sign: one of
  + -
digit-sequence:
  digit
  digit-sequence ’opt digit
floating-suffix: one of
  f l F L
1 A floating literal consists of an optional prefix specifying a base, an integer part, a radix point, a fraction part, an e, E, p or P, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits if there is no prefix, or hexadecimal (base sixteen) digits if the prefix is 0x or 0X. The literal is a decimal floating literal in the former case and a hexadecimal floating literal in the latter case. Optional separating single quotes in a digit-sequence or hexadecimal-
digit-sequence are ignored when determining its value. [ Example: The literals 1.602’176’565e-19 and 1.602176565e-19 have the same value. — end example ] Either the integer part or the fraction part (not both) can be omitted. Either the radix point or the letter e or E and the exponent (not both) can be omitted from a decimal floating literal. The radix point (but not the exponent) can be omitted from a hexadecimal floating literal. The integer part, the optional radix point, and the optional fraction part, form the significand of the floating literal. In a decimal floating literal, the exponent, if present, indicates the power of 10 by which the significand is to be scaled. In a hexadecimal floating literal, the exponent indicates the power of 2 by which the significand is to be scaled. [ Example: The literals 49.625 and 0xC.68p+2 have the same value. — end example ] If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. If the scaled value is not in the range of representable values for its type, the program is ill-formed.
As unwind has advised, you can use strtof(). The following snippet decodes Hexadecimal floating literals (without C++17):
#include <iostream>
#include <cstdlib>
#include <cstdio>
int main(int argc, char *argv[])
{
if (argc != 2)
{
std::cout <<"Usage: "<< argv[0] <<" 0xA.Bp-1 => Decode hexfloat" "\n";
return 1;
}
long double l;
double d;
float f;
std::cout <<"Decode floating point hexadecimal = "<< argv[1];
//std::istringstream(argv[1]) >> std::hexfloat >> d;
l = std::strtold(argv[1],NULL); if(errno == ERANGE) std::cout << "\n" "std::strtold() range error";
d = std::strtod (argv[1],NULL); if(errno == ERANGE) std::cout << "\n" "std::strtod() range error";
f = std::strtof (argv[1],NULL); if(errno == ERANGE) std::cout << "\n" "std::strtod() range error";
std::cout <<"\n" "long double = "<< std::defaultfloat << l <<'\t'<< std::hexfloat << l
<<"\n" "double = "<< std::defaultfloat << d <<'\t'<< std::hexfloat << d
<<"\n" "float = "<< std::defaultfloat << f <<'\t'<< std::hexfloat << f <<'\n';
}

No, C++ doesn't support that for literals, it's not part of the standard.
A non-portable solution is to use a compiler that adds this as an extension (GCC does this).
A portable workaround is to parse them from string literals at runtime using e.g. strtof() or strtod() for double.
As pointed out in a comment, you can also opt to store the constants in a C file. Doing so requires that you have access to a C99 compiler though, since hex float literals is a C99-level feature. Since environments with a new C++ compiler but without a C99 compiler (read: Visual Studio) are quite common, that might not be a workable solution.
Update: C++17 supports hexadecimal floating point literals.

Related

Unexpected results using sprintf and %g to convert double to string

When I use MBCS and msvcr120.dll (12.0.40660.0) I get unexpected results when using %g with sprintf to convert a double to a string. The documentation for %g says default precision will be 6. Why am I seeing the results below?
{
double d = 1234567.00;
char buf[100];
sprintf_s(buf, sizeof(buf), "%g", d);
//result is 1.23457e+006
}
Why is the result 1.23457e+006 instead of 1.23456e+006? Does truncation occur after 6 digits?
Why am I seeing the results below?
This is how the C standard specifies the format in section [Formatted input/output functions] (C++ delegates the specification):
f,F
A double argument representing a floating-point number is converted to decimal notation in style [−]ddd.ddd, where the number of digits after the decimal-point character is equal to the precision specification. If the precision is missing, it is taken as 6; if the precision is zero and the # flag is not specified, no decimal-point character appears. If a decimal-point character appears, at least one digit appears before it. The value is rounded to the appropriate number of digits.
e,E
A double argument representing a floating-point number is converted in the style [-]d.ddde±dd, where there is one digit (which is nonzero if the argument is nonzero) before the decimal-point character and the number of digits after it is equal to the precision; if the precision is missing, it is taken as 6; if the precision is zero and the # flag is not specified, no decimal-point character appears. The value is rounded to the appropriate number of digits. The E conversion specifier produces a number with E instead of e introducing the exponent. The exponent always contains at least two digits, and only as many more digits as necessary to represent the exponent. If the value is zero, the exponent is zero.
A double argument representing an infinity is converted in one of the styles [-]inf or [-]infinity - which style is implementation-defined. A double argument representing a NaN is converted in one of the styles [-]nan* or **[-nan](n-char-sequence) - which style, and the meaning of any n-char-sequence, is implementation-defined. The F conversion specifier produces INF,INFINITY, or NAN instead of inf,infinity ,or nan, respectively.
g,G
A double argument representing a floating-point number is converted in style f or e (or in style F or E in the case of G conversion specifier), depending on the value converted and the precision. Let P equal the precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero. Then, if a conversion with style E would have an exponent of X:
if P > X ≥ -4, the conversion is with style f (or F) and precision P - (X + 1).
otherwise, the conversion is with style e (or E) and precision P - 1.
Why is the result 1.23457e+006 instead of 1.23456e+006?
Because the default precision is 6, and the value is rounded.
The default rounding mode (according to IEEE 754) is "round to nearest and ties to even". The next and previous round values of 1.234567 are 1.23457 and 1.23456. 1.23457 is nearer, so 1.234567 rounds to 1.23457.

sprintf %g specifier gives too few digits after point

I'm trying to write floating point vars into my ini file and i encountered a problem with format specifiers.
I have a float value, let it be 101.9716. Now i want to write it to my ini file, but the problem is i have another float values, which have less preceision (such as 15.85), and that values are writing to ini file in the same loop.
so i do:
sprintf(valLineY, "%g", grade[i].yArr[j]);
All my other variables become nice chars like "20" (if it was 20.00000), "13.85" (if it was 13.850000) and so on. But 101.9716 becomes "101.972" for some reason.
Can you please tell me why does this happen and how to make it "101.9716" without ruining my ideology (which is about removing trailing zeroes and unneeded perceisions).
Thanks for any help.
Why this happens?
I tested:
double f = 101.9716;
printf("%f\n", f);
printf("%e\n", f);
printf("%g\n", f);
And it output:
101.971600
1.019716e+02 // Notice the exponent +02
101.972
Here's what C standard (N1570 7.21.6.1) says about conversion specifier g:
A double argument representing a floating-point number is converted in
style f or e (or in style F or E in the case of a G conversion specifier), depending on the value converted and the precision. Let P equal the precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero. Then, if a conversion with style E would have an exponent of X:
— if P > X ≥ −4, the conversion is with style f (or F) and precision
P − (X + 1).
— otherwise, the conversion is with style e (or E) and precision P − 1.
So given above, P will equal 6, because precision is not specified, and X will equal 2, because it's the exponent on style e.
Formula 6 > 2 >= -4 is thus true, and style f is selected. And precision will then be 6 - (2 + 1) = 3.
How to fix?
Unlike f, style g will strip unnecessary zeroes, even when precision is set.
Finally, unless the # flag is used, any trailing zeros are removed from the fractional portion of the result and the decimal-point character is removed if there is no fractional portion remaining.
So set high enough precision:
printf("%.8g\n", f);
prints:
101.9716

confusing conditional statement in a for loop

I saw some code and I am astonished to see a conditional statement.
I have tried running it but the loop becomes a infinite loop.
The full code is :
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
int *n;
for (int i=0; i<5e7; i++)
n = new int;
delete n;
}
How does that even compile, and why does the loop become an infinite loop?
What are these type of conditional statements?
To represent floating constants the C++ Standard introduces floating literals (
2.14.4 Floating literals )
floating-literal:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt. digit-sequence
digit-sequence .
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
sign: one of
+ -
digit-sequence:
digit
digit-sequence ’opt digit
floating-suffix: one of
f l F L
Thus 5e7 is a floating literal. You can output it on console for example the following way
std::cout << std::fixed << std::setprecision( 0 ) << 5e7 << std::endl;
and the output will be
50000000
According to the rules of the usual arithmetic conversions in this condition
i < 5e7
variable i is converted to type double (because the floating literal has type double)and compared with the floating literal. As soon as i will be greater than or equal to the floating literal the loop stops its iterations. It can occur in case when maximum value of type int is not less than the value of the floating literal.
You can check the maximum value of an object of type int the following way
#include <limits>
//...
std::cout << std::numeric_limits<int>::max() << std::endl;
Where I run this code I got the following result
2147483647
Thus for these values the loop can not be infinite. However maybe in your system the maximum value of int is less than the value of the floating literal. In this case the loop indeed will be infinite.

Why do compilers fix the digits of floating point number to 6?

According to The C++ Programming Language - 4th, section 6.2.5:
There are three floating-points types: float (single-precision), double (double-precision), and long double (extended-precision)
Refer to: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
→ The maximum digits of floating point number is 7 digits on binary32 interchange format. (a computer number format that occupies 4 bytes (32 bits) in computer memory)
When I test on different compilers (like GCC, VC compiler)
→ It always outputs 6 as the value.
Take a look into float.h of each compiler
→ I found that 6 is fixed.
Question:
Do you know why there is a different here (between actual value theoretical value - 7 - and actual value - 6)?
It sounds like "7" is more reasonable because when I test using below code, the value is still valid, while "8" is invalid
Why don't the compilers check the interchange format for giving decision about the numbers of digits represented in floating-point (instead of using a fixed value)?
Code:
#include <iostream>
#include <limits>
using namespace std;
int main( )
{
cout << numeric_limits<float> :: digits10 << endl;
float f = -9999999;
cout.precision ( 10 );
cout << f << endl;
}
You're not reading the documentation.
std::numeric_limits<float>::digits10 is 6:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
std::numeric_limits<float>::max_digits10 is 9:
The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T, such as necessary for serialization/deserialization to text. This constant is meaningful for all floating-point types.
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
std::numeric_limits<float>::digits10 equates to FLT_DIG, which is defined by the C standard :
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
⎧ p log10 b if b is a power of 10
⎨
⎩ ⎣( p − 1) log10 b⎦ otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
The reason for the value 6 (and not 7), is due to rounding errors - not all floating point values with 7 decimal digits can be losslessly represented by a 32-bit float. Rounding errors are limited to 1 bit though, so the FLT_DIG value was calculated based on 23 bits (instead of the full 24) :
23 * log10(2) = 6.92
which is rounded down to 6.

wstringstream default format flag

I am trying to convert 3 byte double to string. Following is my code.
double b = 0xFFFFFF;
std::wstring ss;
std::wstringstream sOut;
sOut << b;
ss = boost::lexical_cast<std::wstring>(sOut.str());
I expect output to be 16777215. But "ss" has the value 1.67772e+007.
However when I use "fixed" flag, I get the expected output.
sOut << std::fixed
My question is whether wstringstream has "scientific" flag by default ?
Thanks,
All streams (not just wstringstream) have floating-point formatting flags set to ios_base::defaultfloat by default, which requests the formatting you're observing, which is equivalent to printf's conversion specifier %g.
To quote C's description of %g
A double argument representing a floating-point number is converted in style f or e (or in style F or E in the case of a G conversion specifier), depending on the value converted and the precision. Let P equal the precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero. Then, if a conversion with style E would have an exponent of X:
if P > X >= -4, the conversion is with style f (or F) and precision P - (X + 1).
otherwise, the conversion is with style e (or E) and precision P - 1.
In your case, "the style e" is selected.