Why does std::round(sin(pi/6)) not equal 1?

Why does std::round(sin(pi/6)) not equal 1? - c++

The cppreference documentation states that std::round will specifically round away from zero in "halfway cases." While this is true with the literal 0.5, it's not true with std::sin(pi/6). I thought this might be a floating point error, so I printed the value but it's exactly 0.5. After inspecting the binary representation however, I can see that they are indeed represented differently. I've provided the code I used to make these inspections below.
#include <iostream>
#include <stdio.h>
#include <cmath>
int main(int argc, char * argv[])
{
double const pi = std::acos(-1);
double const a = std::sin(pi/6);
double const b = 0.5;
std::cout << "round(" << a << ") = " << std::round(a) << "\n";
auto pa = reinterpret_cast<const unsigned char *>(&a);
auto pb = reinterpret_cast<const unsigned char *>(&b);
std::cout << "a = 0x";
for (size_t i = 0; i != sizeof(double); ++i) {
printf("%02x", pa[i]);
}
std::cout << "\nb = 0x";
for (size_t i = 0; i != sizeof(double); ++i) {
printf("%02x", pb[i]);
}
std::cout << "\n";
}
round(0.5) = 0
round(0.5) = 1
a = 0xffffffffffffdf3f
b = 0x000000000000e03f
So my question is this rounding behavior a part of the c++ specification or is this a bug? And in any case, is there some general way that I can "correct" the representation of the value returned by sin? I'm not sure what format it's in because based off what I know of IEEE-754, it looks like it should be NaN. Although from what I understand, c++ doesn't guarantee IEEE-754 floating point representation?

The issue is that you're not printing the value with enough significant digits. When I increase precision with std::setprecision(20), I get: round(0.49999999999999994449) = 0.
You can see this for yourself by either changing the code or entering 3fdfffffffffffff into the bottom Hexadecimal field of this online calculator: https://baseconvert.com/ieee-754-floating-point

The representation looks like NaN because you're reading it backwards. x86/x64 have little-endian floating point numbers. So you should read that it from high to low address, yielding 0x3fdfffff..., which is of course slightly less than 0.5.

Related

How do I flip the bits of a double?

Consider this code:
#include <iostream>
int main(){
double k = ~0.0;
std::cout << k << "\n";
}
It doesn't compile. I want to get a double value with all the bits set, which would be a NaN. Why doesn't this code work, and how do I flip all the bits of a double?

Regarding the code in the original question:
The 0 here is the int literal 0. ~0 is an int with value -1. You are initializing k with the int -1. The conversion from int to double doesn't change the numerical value (but does change the bit pattern), and then you print out the resulting double (which is still representing -1).
Now, for the current question: You can't apply bitwise NOT to a double. It's just not an allowed operation, precisely because it tends not to do anything useful to floating point values. It exists for built in integral types (plus anything with operator~) only.
If you would like to flip all the bits in an object, the standard conformant way is to do something like this:
#include <memory>
void flip_bits(auto &x) {
// iterate through bytes of x and flip all of them
std::byte *p = reinterpret_cast<std::byte*>(std::addressof(x));
for(std::size_t i = 0; i < sizeof(x); i++) p[i] = ~p[i];
}
Then
int main() {
double x = 0;
flip_bits(x);
std::cout << x << "\n";
}
may (will usually) print some variation of nan (dependent on how your implementation actually represents double, of course).
Example on Godbolt

// the numeric constant ~0 is an integer
int foo = ~0;
std::cout << foo << '\n'; //< prints -1
// now it converts the int value of -1 to a double.
double k = foo;
If you want to invert all of the bits you'll need to use a union with a uint64.
#include <iostream>
#include <cstdint>
int main(){
union {
double k;
uint64_t u;
} double_to_uint64;
double_to_uint64.u = ~0ULL;
std::cout << double_to_uint64.k;
}
Which will result in a -NAN.

Double to uint64_t conversion

Following demo program demonstrates some behaviour I don't understand.
#include <string>
#include <limits>
#include <iostream>
constexpr double bits64 = 18446744073709551616.0; // 2^64
void diff_hash(double diff)
{
double hash = bits64 / diff;
uint64_t hash_64_1 = hash;
uint64_t hash_64_2 = hash < std::numeric_limits<uint64_t>::max() ? hash : std::numeric_limits<uint64_t>::max();
uint64_t hash_64_3 = std::numeric_limits<uint64_t>::max();
if(hash < hash_64_3){
hash_64_3 = hash;
}
std::cout << "hash_64_1: " << hash_64_1 << ", " << "hash_64_2: " << hash_64_2 << ", " << "hash_64_3: " << hash_64_3 << std::endl;
}
int main()
{
diff_hash(1);
return 0;
}
output
hash_64_1: 0, hash_64_2: 0, hash_64_3: 18446744073709551615
Questions:
1.) Why is hash_64_1 == 0? Event though value that is assigned is clearly max 64 value
2.) Why is hash_64_2 == 0? I confirmed that if I change the line to
uint64_t hash_64_2 = hash < std::numeric_limits<uint64_t>::max() ? hash : std::numeric_limits<uint32_t>::max();
the value of hash_64_2 max 32 value
Link to Wandbox example https://wandbox.org/permlink/HyXRX2CiNgIIpYkQ

18446744073709551616.0 / 1.0 is evaluated as a double. Its value is 18446744073709551616.0, assuming IEEE754. The behaviour on converting this to an out of range uint64_t is undefined. A common manifestation of that undefined behaviour is wrap-around to 0. (That's what most folk assume happens, but the behaviour is undefined when converting from an out of range floating point value.)
With the expression hash < std::numeric_limits<uint64_t>::max(), the right hand side is converted implicitly to a double. But that number cannot be represented as a double, so it is rounded to the nearest double, which is 18446744073709551616.0. Hence hash_64_2 is 0 too.

1.) Why is hash_64_1 == 0? Event though value that is assigned is clearly max 64 value
That is actually hardly clear. hash is clearly greater than max 64 value. The behaviour of converting an unrepresentable (in target type) floating point to integer is undefined.

"is assigned is clearly max 64 value" --> Off-by-one.
The max uint64_t value is 18446744073709551615, not 18446744073709551616.
Effects seen are due to UB of converting an out of range double to uint64_t.

Converting a floating point number to string with MPFR

I want to convert a MPFR floating point number into a string.
If I run my program the string is generated but without the "." in the number. How can I do it right?
#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <mpreal.h>
using mpfr::mpreal;
using std::cout;
using std::endl;
int main (int ac, char *av[])
{
char data[255];
mpreal x = 42.0, y = 3.14159265358979323846, res = 0.0;
mp_exp_t exponent = 10;
// string data_str[256];
int precision = 50;
res = x * y;
cout.precision(100);
cout << res;
cout << "\n";
// if (mpfr_snprintf (data, 254, "%.20Ff", res.mpfr_srcptr()) < 0)
/*
if (mpfr_snprintf (data, 254, "%.20Ff", res.mpfr_srcptr()) < 0)
{
cout << "gmp_prints_float: error saving string!\n";
}
*/
mpfr_get_str ((char *) &data, &exponent, 10, precision, res.mpfr_srcptr(), GMP_RNDN);
cout << data;
cout << "\n";
mpfr_free_cache ();
}
131.946891450771317977341823279857635498046875
13194689145077131797734182327985763549804687500000
There is no decimal point in the string output!

From the documentation
The generated string is a fraction, with an implicit radix point immediately to the left of the first digit. For example, the number -3.1416 would be returned as "-31416" in the string and 1 written at expptr.
It is up to you to generate a human-readable representation fron the string and the exponent.
An alternative would be to use mpfr_sprintf.

MPFR's mpfr_get_str function is copied on GMP's mpf_get_str function, explaining why it has been chosen not to write a decimal point. There are two solutions to have a decimal point:
Use mpfr_sprintf (or some variant), as suggested in this answer. I would recommend this solution (perhaps unless you want to ignore the locales), as it is the most flexible in the output format and does not need a correction.
If you just want the significand with an explicit decimal point, use mpfr_get_str, but with a pointer buffer+1 instead of buffer. Then, do something like (disregarding the locales)
int neg = buffer[1] == '-';
if (neg)
buffer[0] = '-';
buffer[neg] = '.';
after filtering the special cases (NaN and infinities).

Why using double and then cast to float?

I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?

Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.

You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

How to print a C++ double with the correct number of significant decimal digits?

When dealing with floating point values in Java, calling the toString() method gives a printed value that has the correct number of floating point significant figures. However, in C++, printing a float via stringstream will round the value after 5 or less digits. Is there a way to "pretty print" a float in C++ to the (assumed) correct number of significant figures?
EDIT: I think I am being misunderstood. I want the output to be of dynamic length, not a fixed precision. I am familiar with setprecision. If you look at the java source for Double, it calculates the number of significant digits somehow, and I would really like to understand how it works and/or how feasible it is to replicate this easily in C++.
/*
* FIRST IMPORTANT CONSTRUCTOR: DOUBLE
*/
public FloatingDecimal( double d )
{
long dBits = Double.doubleToLongBits( d );
long fractBits;
int binExp;
int nSignificantBits;
// discover and delete sign
if ( (dBits&signMask) != 0 ){
isNegative = true;
dBits ^= signMask;
} else {
isNegative = false;
}
// Begin to unpack
// Discover obvious special cases of NaN and Infinity.
binExp = (int)( (dBits&expMask) >> expShift );
fractBits = dBits&fractMask;
if ( binExp == (int)(expMask>>expShift) ) {
isExceptional = true;
if ( fractBits == 0L ){
digits = infinity;
} else {
digits = notANumber;
isNegative = false; // NaN has no sign!
}
nDigits = digits.length;
return;
}
isExceptional = false;
// Finish unpacking
// Normalize denormalized numbers.
// Insert assumed high-order bit for normalized numbers.
// Subtract exponent bias.
if ( binExp == 0 ){
if ( fractBits == 0L ){
// not a denorm, just a 0!
decExponent = 0;
digits = zero;
nDigits = 1;
return;
}
while ( (fractBits&fractHOB) == 0L ){
fractBits <<= 1;
binExp -= 1;
}
nSignificantBits = expShift + binExp +1; // recall binExp is - shift count.
binExp += 1;
} else {
fractBits |= fractHOB;
nSignificantBits = expShift+1;
}
binExp -= expBias;
// call the routine that actually does all the hard work.
dtoa( binExp, fractBits, nSignificantBits );
}
After this function, it calls dtoa( binExp, fractBits, nSignificantBits ); which handles a bunch of cases - this is from OpenJDK6
For more clarity, an example:
Java:
double test1 = 1.2593;
double test2 = 0.004963;
double test3 = 1.55558742563;
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
Output:
1.2593
0.004963
1.55558742563
C++:
std::cout << test1 << "\n";
std::cout << test2 << "\n";
std::cout << test3 << "\n";
Output:
1.2593
0.004963
1.55559

I think you are talking about how to print the minimum number of floating point digits that allow you to read the exact same floating point number back. This paper is a good introduction to this tricky problem.
http://grouper.ieee.org/groups/754/email/pdfq3pavhBfih.pdf
The dtoa function looks like David Gay's work, you can find the source here http://www.netlib.org/fp/dtoa.c (although this is C not Java).
Gay also wrote a paper about his method. I don't have a link but it's referenced in the above paper so you can probably google it.

Is there a way to "pretty print" a float in C++ to the (assumed) correct number of significant figures?
Yes, you can do it with C++20 std::format, for example:
double test1 = 1.2593;
double test2 = 0.004963;
double test3 = 1.55558742563;
std::cout << std::format("{}", test1) << "\n";
std::cout << std::format("{}", test2) << "\n";
std::cout << std::format("{}", test3) << "\n";
prints
1.2593
0.004963
1.55558742563
The default format will give you the shortest decimal representation with a round-trip guarantee like in Java.
Since this is a new feature and may not be supported by some standard libraries yet, you can use the {fmt} library, std::format is based on. {fmt} also provides the print function that makes this even easier and more efficient (godbolt):
fmt::print("{}", 1.2593);
Disclaimer: I'm the author of {fmt} and C++20 std::format.

You can use the ios_base::precision technique where you can specify the number of digits you want
For example
#include <iostream>
using namespace std;
int main () {
double f = 3.14159;
cout.unsetf(ios::floatfield); // floatfield not set
cout.precision(5);
cout << f << endl;
cout.precision(10);
cout << f << endl;
cout.setf(ios::fixed,ios::floatfield); // floatfield set to fixed
cout << f << endl;
return 0;
The above code with output
3.1416
3.14159
3.1415900000

There is a utility called numeric_limits:
#include <limits>
...
int num10 = std::numeric_limits<double>::digits10;
int max_num10 = std::numeric_limits<double>::max_digits10;
Note that IEEE numbers are not represented exactly bydecimal digits. These are binary quantities. A more accurate number is the number of binary bits:
int bits = std::numeric_limits<double>::digits;
To pretty print all the significant digits use setprecision with this:
out.setprecision(std::numeric_limits<double>::digits10);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does std::round(sin(pi/6)) not equal 1? - c++

The representation looks like NaN because you're reading it backwards. x86/x64 have little-endian floating point numbers. So you should read that it from high to low address, yielding 0x3fdfffff..., which is of course slightly less than 0.5.

Related

How do I flip the bits of a double?

Double to uint64_t conversion

Converting a floating point number to string with MPFR

Why using double and then cast to float?

How to print a C++ double with the correct number of significant decimal digits?

Categories

Resources