What is the difference between 0 and -0 floating point value? - c++

This code snippet in Visual Studio 2013:
double a = 0.0;
double b = -0.0;
cout << (a == b) << " " << a << " " << b;
prints 1 0 -0. What is the difference between a and b?

C++ does not guarantee to differentiate between +0 and -0. This is a feature of each particular number representation. The IEEE 754 standard for floating point arithmetic does make this distinction, which can be used to keep sign information even when numbers go to zero. std::numeric_limits does not directly tell you if you have possible signed zeroes. But if std::numeric_limits<double>::is_iec559 is true then you can in practice assume that you have IEEE 754 representation, and thus possibly negative zero.
Noted by “gmch” in a comment, the C++11 standard library way to check the sign of a zero is to use std::copysign, or more directly using std::signbit, e.g. as follows:
#include <iostream>
#include <math.h> // copysign, signbit
using namespace std;
auto main() -> int
{
double const z1 = +0.0;
double const z2 = -0.0;
cout << boolalpha;
cout << "z1 is " << (signbit( z1 )? "negative" : "positive") << "." << endl;
cout << "z2 is " << (signbit( z2 )? "negative" : "positive") << "." << endl;
}
Without copysign or signbit, e.g. for a C++03 compiler, one way to detect a negative zero z is to check whether 1.0/z is negative infinity, e.g. by checking if it's just negative.
#include <iostream>
using namespace std;
auto main() -> int
{
double const z1 = +0.0;
double const z2 = -0.0;
cout << boolalpha;
cout << "z1 is " << (1/z1 < 0? "negative" : "positive") << "." << endl;
cout << "z2 is " << (1/z2 < 0? "negative" : "positive") << "." << endl;
}
But while this will probably work in practice on most any implementation, it's formally *Undefined Behavior.
One needs to be sure that the expression evaluation will not trap.
*) C++11 §5.6/4 “If the second operand of / or % is zero the behavior is undefined”

See http://en.m.wikipedia.org/wiki/Signed_zero
In a nutshell, it is due to the sign being stored as a stand-alone bit in IEEE 754 floating point representation. This leads to being able to have a zero exponent and fractional portions but still have the sign bit set--thus a negative zero. This is a condition that wouldn't happen for signed integers which are stored in twos-complement.

Related

Why does C++ automatically rounds float [duplicate]

In my earlier question I was printing a double using cout that got rounded when I wasn't expecting it. How can I make cout print a double using full precision?
You can set the precision directly on std::cout and use the std::fixed format specifier.
double d = 3.14159265358979;
cout.precision(17);
cout << "Pi: " << fixed << d << endl;
You can #include <limits> to get the maximum precision of a float or double.
#include <limits>
typedef std::numeric_limits< double > dbl;
double d = 3.14159265358979;
cout.precision(dbl::max_digits10);
cout << "Pi: " << d << endl;
Use std::setprecision:
#include <iomanip>
std::cout << std::setprecision (15) << 3.14159265358979 << std::endl;
Here is what I would use:
std::cout << std::setprecision (std::numeric_limits<double>::digits10 + 1)
<< 3.14159265358979
<< std::endl;
Basically the limits package has traits for all the build in types.
One of the traits for floating point numbers (float/double/long double) is the digits10 attribute. This defines the accuracy (I forget the exact terminology) of a floating point number in base 10.
See: http://www.cplusplus.com/reference/std/limits/numeric_limits.html
For details about other attributes.
How do I print a double value with full precision using cout?
Use hexfloat or
use scientific and set the precision
std::cout.precision(std::numeric_limits<double>::max_digits10 - 1);
std::cout << std::scientific << 1.0/7.0 << '\n';
// C++11 Typical output
1.4285714285714285e-01
Too many answers address only one of 1) base 2) fixed/scientific layout or 3) precision. Too many answers with precision do not provide the proper value needed. Hence this answer to a old question.
What base?
A double is certainly encoded using base 2. A direct approach with C++11 is to print using std::hexfloat.
If a non-decimal output is acceptable, we are done.
std::cout << "hexfloat: " << std::hexfloat << exp (-100) << '\n';
std::cout << "hexfloat: " << std::hexfloat << exp (+100) << '\n';
// output
hexfloat: 0x1.a8c1f14e2af5dp-145
hexfloat: 0x1.3494a9b171bf5p+144
Otherwise: fixed or scientific?
A double is a floating point type, not fixed point.
Do not use std::fixed as that fails to print small double as anything but 0.000...000. For large double, it prints many digits, perhaps hundreds of questionable informativeness.
std::cout << "std::fixed: " << std::fixed << exp (-100) << '\n';
std::cout << "std::fixed: " << std::fixed << exp (+100) << '\n';
// output
std::fixed: 0.000000
std::fixed: 26881171418161356094253400435962903554686976.000000
To print with full precision, first use std::scientific which will "write floating-point values in scientific notation". Notice the default of 6 digits after the decimal point, an insufficient amount, is handled in the next point.
std::cout << "std::scientific: " << std::scientific << exp (-100) << '\n';
std::cout << "std::scientific: " << std::scientific << exp (+100) << '\n';
// output
std::scientific: 3.720076e-44
std::scientific: 2.688117e+43
How much precision (how many total digits)?
A double encoded using the binary base 2 encodes the same precision between various powers of 2. This is often 53 bits.
[1.0...2.0) there are 253 different double,
[2.0...4.0) there are 253 different double,
[4.0...8.0) there are 253 different double,
[8.0...10.0) there are 2/8 * 253 different double.
Yet if code prints in decimal with N significant digits, the number of combinations [1.0...10.0) is 9/10 * 10N.
Whatever N (precision) is chosen, there will not be a one-to-one mapping between double and decimal text. If a fixed N is chosen, sometimes it will be slightly more or less than truly needed for certain double values. We could error on too few (a) below) or too many (b) below).
3 candidate N:
a) Use an N so when converting from text-double-text we arrive at the same text for all double.
std::cout << dbl::digits10 << '\n';
// Typical output
15
b) Use an N so when converting from double-text-double we arrive at the same double for all double.
// C++11
std::cout << dbl::max_digits10 << '\n';
// Typical output
17
When max_digits10 is not available, note that due to base 2 and base 10 attributes, digits10 + 2 <= max_digits10 <= digits10 + 3, we can use digits10 + 3 to insure enough decimal digits are printed.
c) Use an N that varies with the value.
This can be useful when code wants to display minimal text (N == 1) or the exact value of a double (N == 1000-ish in the case of denorm_min). Yet since this is "work" and not likely OP's goal, it will be set aside.
It is usually b) that is used to "print a double value with full precision". Some applications may prefer a) to error on not providing too much information.
With .scientific, .precision() sets the number of digits to print after the decimal point, so 1 + .precision() digits are printed. Code needs max_digits10 total digits so .precision() is called with a max_digits10 - 1.
typedef std::numeric_limits< double > dbl;
std::cout.precision(dbl::max_digits10 - 1);
std::cout << std::scientific << exp (-100) << '\n';
std::cout << std::scientific << exp (+100) << '\n';
// Typical output
3.7200759760208361e-44
2.6881171418161356e+43
//2345678901234567 17 total digits
Similar C question
In C++20 you'll be able to use std::format to do this:
std::cout << std::format("{}", M_PI);
Output (assuming IEEE754 double):
3.141592653589793
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to the setprecision I/O manipulator is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. {fmt} also provides the print function that makes this even easier and more efficient (godbolt):
fmt::print("{}", M_PI);
Disclaimer: I'm the author of {fmt} and C++20 std::format.
The iostreams way is kind of clunky. I prefer using boost::lexical_cast because it calculates the right precision for me. And it's fast, too.
#include <string>
#include <boost/lexical_cast.hpp>
using boost::lexical_cast;
using std::string;
double d = 3.14159265358979;
cout << "Pi: " << lexical_cast<string>(d) << endl;
Output:
Pi: 3.14159265358979
Here is how to display a double with full precision:
double d = 100.0000000000005;
int precision = std::numeric_limits<double>::max_digits10;
std::cout << std::setprecision(precision) << d << std::endl;
This displays:
100.0000000000005
max_digits10 is the number of digits that are necessary to uniquely represent all distinct double values. max_digits10 represents the number of digits before and after the decimal point.
Don't use set_precision(max_digits10) with std::fixed.
On fixed notation, set_precision() sets the number of digits only after the decimal point. This is incorrect as max_digits10 represents the number of digits before and after the decimal point.
double d = 100.0000000000005;
int precision = std::numeric_limits<double>::max_digits10;
std::cout << std::fixed << std::setprecision(precision) << d << std::endl;
This displays incorrect result:
100.00000000000049738
Note: Header files required
#include <iomanip>
#include <limits>
By full precision, I assume mean enough precision to show the best approximation to the intended value, but it should be pointed out that double is stored using base 2 representation and base 2 can't represent something as trivial as 1.1 exactly. The only way to get the full-full precision of the actual double (with NO ROUND OFF ERROR) is to print out the binary bits (or hex nybbles).
One way of doing that is using a union to type-pun the double to a integer and then printing the integer, since integers do not suffer from truncation or round-off issues. (Type punning like this is not supported by the C++ standard, but it is supported in C. However, most C++ compilers will probably print out the correct value anyways. I think g++ supports this.)
union {
double d;
uint64_t u64;
} x;
x.d = 1.1;
std::cout << std::hex << x.u64;
This will give you the 100% accurate precision of the double... and be utterly unreadable because humans can't read IEEE double format ! Wikipedia has a good write up on how to interpret the binary bits.
In newer C++, you can do
std::cout << std::hexfloat << 1.1;
C++20 std::format
This great new C++ library feature has the advantage of not affecting the state of std::cout as std::setprecision does:
#include <format>
#include <string>
int main() {
std::cout << std::format("{:.2} {:.3}\n", 3.1415, 3.1415);
}
Expected output:
3.14 3.142
As mentioned at https://stackoverflow.com/a/65329803/895245 if you don't pass the precision explicitly it prints the shortest decimal representation with a round-trip guarantee. TODO understand in more detail how it compares to: dbl::max_digits10 as shown at https://stackoverflow.com/a/554134/895245 with {:.{}}:
#include <format>
#include <limits>
#include <string>
int main() {
std::cout << std::format("{:.{}}\n",
3.1415926535897932384626433, dbl::max_digits10);
}
See also:
Set back default floating point print precision in C++ for how to restore the initial precision in pre-c++20
std::string formatting like sprintf
https://en.cppreference.com/w/cpp/utility/format/formatter#Standard_format_specification
IEEE 754 floating point values are stored using base 2 representation. Any base 2 number can be represented as a decimal (base 10) to full precision. None of the proposed answers, however, do. They all truncate the decimal value.
This seems to be due to a misinterpretation of what std::numeric_limits<T>::max_digits10 represents:
The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T.
In other words: It's the (worst-case) number of digits required to output if you want to roundtrip from binary to decimal to binary, without losing any information. If you output at least max_digits10 decimals and reconstruct a floating point value, you are guaranteed to get the exact same binary representation you started with.
What's important: max_digits10 in general neither yields the shortest decimal, nor is it sufficient to represent the full precision. I'm not aware of a constant in the C++ Standard Library that encodes the maximum number of decimal digits required to contain the full precision of a floating point value. I believe it's something like 767 for doubles1. One way to output a floating point value with full precision would be to use a sufficiently large value for the precision, like so2, and have the library strip any trailing zeros:
#include <iostream>
int main() {
double d = 0.1;
std::cout.precision(767);
std::cout << "d = " << d << std::endl;
}
This produces the following output, that contains the full precision:
d = 0.1000000000000000055511151231257827021181583404541015625
Note that this has significantly more decimals than max_digits10 would suggest.
While that answers the question that was asked, a far more common goal would be to get the shortest decimal representation of any given floating point value, that retains all information. Again, I'm not aware of any way to instruct the Standard I/O library to output that value. Starting with C++17 the possibility to do that conversion has finally arrived in C++ in the form of std::to_chars. By default, it produces the shortest decimal representation of any given floating point value that retains the entire information.
Its interface is a bit clunky, and you'd probably want to wrap this up into a function template that returns something you can output to std::cout (like a std::string), e.g.
#include <charconv>
#include <array>
#include <string>
#include <system_error>
#include <iostream>
#include <cmath>
template<typename T>
std::string to_string(T value)
{
// 24 characters is the longest decimal representation of any double value
std::array<char, 24> buffer {};
auto const res { std::to_chars(buffer.data(), buffer.data() + buffer.size(), value) };
if (res.ec == std::errc {})
{
// Success
return std::string(buffer.data(), res.ptr);
}
// Error
return { "FAILED!" };
}
int main()
{
auto value { 0.1f };
std::cout << to_string(value) << std::endl;
value = std::nextafter(value, INFINITY);
std::cout << to_string(value) << std::endl;
value = std::nextafter(value, INFINITY);
std::cout << to_string(value) << std::endl;
}
This would print out (using Microsoft's C++ Standard Library):
0.1
0.10000001
0.10000002
1 From Stephan T. Lavavej's CppCon 2019 talk titled Floating-Point <charconv>: Making Your Code 10x Faster With C++17's Final Boss. (The entire talk is worth watching.)
2 This would also require using a combination of scientific and fixed, whichever is shorter. I'm not aware of a way to set this mode using the C++ Standard I/O library.
printf("%.12f", M_PI);
%.12f means floating point, with precision of 12 digits.
The best option is to use std::setprecision, and the solution works like this:
# include <iostream>
# include <iomanip>
int main()
{
double a = 34.34322;
std::cout<<std::fixed<<a<<std::setprecision(0)<<std::endl;
return 0;
}
Note: you do not need to use cout.setprecision to do it and I fill up 0 at std::setprecision because it must have a argument.
Most portably...
#include <limits>
using std::numeric_limits;
...
cout.precision(numeric_limits<double>::digits10 + 1);
cout << d;
In this question there is a description on how to convert a double to string losselessly (in Octave, but it can be easily reproduced in C++). De idea is to have a short human readable description of the float and a losseless description in hexa form, for instance: pi -> 3.14{54442d18400921fb}.
Here is a function that works for any floating-point type, not just double, and also puts the stream back the way it was found afterwards. Unfortunately it won't interact well with threads, but that's the nature of iostreams. You'll need these includes at the start of your file:
#include <limits>
#include <iostream>
Here's the function, you could it in a header file if you use it a lot:
template <class T>
void printVal(std::ostream& os, T val)
{
auto oldFlags = os.flags();
auto oldPrecision = os.precision();
os.flags(oldFlags & ~std::ios_base::floatfield);
os.precision(std::numeric_limits<T>::digits10);
os << val;
os.flags(oldFlags);
os.precision(oldPrecision);
}
Use it like this:
double d = foo();
float f = bar();
printVal(std::cout, d);
printVal(std::cout, f);
If you want to be able to use the normal insertion << operator, you can use this extra wrapper code:
template <class T>
struct PrintValWrapper { T val; };
template <class T>
std::ostream& operator<<(std::ostream& os, PrintValWrapper<T> pvw) {
printVal(os, pvw.val);
return os;
}
template <class T>
PrintValWrapper<T> printIt(T val) {
return PrintValWrapper<T>{val};
}
Now you can use it like this:
double d = foo();
float f = bar();
std::cout << "The values are: " << printIt(d) << ", " << printIt(f) << '\n';
This will show the value up to two decimal places after the dot.
#include <iostream>
#include <iomanip>
double d = 2.0;
int n = 2;
cout << fixed << setprecision(n) << d;
See here: Fixed-point notation
std::fixed
Use fixed floating-point notation Sets the floatfield format flag for
the str stream to fixed.
When floatfield is set to fixed, floating-point values are written
using fixed-point notation: the value is represented with exactly as
many digits in the decimal part as specified by the precision field
(precision) and with no exponent part.
std::setprecision
Set decimal precision Sets the decimal precision to be used to format
floating-point values on output operations.
If you're familiar with the IEEE standard for representing the floating-points, you would know that it is impossible to show floating-points with full-precision out of the scope of the standard, that is to say, it will always result in a rounding of the real value.
You need to first check whether the value is within the scope, if yes, then use:
cout << defaultfloat << d ;
std::defaultfloat
Use default floating-point notation Sets the floatfield format flag
for the str stream to defaultfloat.
When floatfield is set to defaultfloat, floating-point values are
written using the default notation: the representation uses as many
meaningful digits as needed up to the stream's decimal precision
(precision), counting both the digits before and after the decimal
point (if any).
That is also the default behavior of cout, which means you don't use it explicitly.
With ostream::precision(int)
cout.precision( numeric_limits<double>::digits10 + 1);
cout << M_PI << ", " << M_E << endl;
will yield
3.141592653589793, 2.718281828459045
Why you have to say "+1" I have no clue, but the extra digit you get out of it is correct.

Why in C++ do static_cast<unsigned> of negative numbers differ if the number is constant or not

What's the C++ rules that means equal is false?. Given:
float f {-1.0};
bool equal = (static_cast<unsigned>(f) == static_cast<unsigned>(-1.0));
E.g. https://godbolt.org/z/fcmx2P
#include <iostream>
int main()
{
float f {-1.0};
const float cf {-1.0};
std::cout << std::hex;
std::cout << " f" << "=" << static_cast<unsigned>(f) << '\n';
std::cout << "cf" << "=" << static_cast<unsigned>(cf) << '\n';
return 0;
}
Produces the following output:
f=ffffffff
cf=0
The behaviour of your program is undefined: the C++ standard does not define the conversion of a negative floating point type to an unsigned type.
(Note the familiar wrap-around behaviour only applies to negative integral types.)
So therefore there's little point in attempting to explain your program output.

std::cout with floating number

I'm using visual studio 2015 to print two floating numbers:
double d1 = 1.5;
double d2 = 123456.789;
std::cout << "value1: " << d1 << std::endl;
std::cout << "value2: " << d2 << std::endl;
std::cout << "maximum number of significant decimal digits (value1): " << -std::log10(std::nextafter(d1, std::numeric_limits<double>::max()) - d1) << std::endl;
std::cout << "maximum number of significant decimal digits (value2): " << -std::log10(std::nextafter(d2, std::numeric_limits<double>::max()) - d2) << std::endl;
This prints the following:
value1: 1.5
value2: 123457
maximum number of significant decimal digits (value1): 15.6536
maximum number of significant decimal digits (value2): 10.8371
Why 123457 is printed out for the value 123456.789? Does ANSI C++ specification allow to display anything for floating numbers when std::cout is used without std::setprecision()?
The rounding off happens because of the C++ standard which can be seen by writing
std::cout<<std::cout.precision();
The output screen will show 6 which tells that the default number of significant digits which will be printed by the std::cout statement is 6. That is why it automatically rounds off the floating number to 6 digits.
What you have have pointed out is actually one of those many things that the standardization committee should consider regarding the standard iostream in C++. Such things work well when you write :-
printf ("%f\n", d2);
But not with std::cout where you need to use std::setprecision because it's formatting is similar to the use of %g instead of %f in printf. So you need to write :-
std::cout << std::setprecision(10) << "value2: " << d2 << std::endl;
But if you dont like this method & are using C++11 (& onwards) then you can also write :-
std::cout << "value2: " << std::to_string(d2) << std::endl;
This will give you the same result as printf ("%f\n", d2);.
A much better method is to cancel the rounding that occurs in std::cout by using std::fixed :-
#include <iostream>
#include <iomanip>
int main()
{
std::cout << std::fixed;
double d = 123456.789;
std::cout << d;
return 0;
}
Output :-
123456.789000
So I guess your problem is solved !!
I think the problem here is that the C++ standard is not written to be easy to read, it is written to be precise and not repeat itself. So if you look up the operator<<(double), it doesn't say anything other than "it uses num_put - because that is how the cout << some_float_value is implemented.
The default behaviour is what print("%g", value); does [table 88 in n3337 version of the C++ standard explains what the equivalence of printf and c++ formatting]. So if you want to do %.16g you need to change the precision by calling setprecision(16).

How to produce formatting similar to .NET's '0.###%' in iostreams?

I would like to output a floating-point number as a percentage, with up to three decimal places.
I know that iostreams have three different ways of presenting floats:
"default", which displays using either the rules of fixed or scientific, depending on the number of significant digits desired as defined by setprecision;
fixed, which displays a fixed number of decimal places defined by setprecision; and
scientific, which displays a fixed number of decimal places but using scientific notation, i.e. mantissa + exponent of the radix.
These three modes can be seen in effect with this code:
#include <iostream>
#include <iomanip>
int main() {
double d = 0.00000095;
double e = 0.95;
std::cout << std::setprecision(3);
std::cout.unsetf(std::ios::floatfield);
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
std::cout << std::fixed;
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
std::cout << std::scientific;
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
}
// output:
// d = 9.5e-05%
// e = 95%
// d = 0.000%
// e = 95.000%
// d = 9.500e-05%
// e = 9.500e+01%
None of these options satisfies me.
I would like to avoid any scientific notation here as it makes the percentages really hard to read. I want to keep at most three decimal places, and it's ok if very small values show up as zero. However, I would also like to avoid trailing zeros in fractional places for cases like 0.95 above: I want that to display as in the second line, as "95%".
In .NET, I can achieve this with a custom format string like "0.###%", which gives me a number formatted as a percentage with at least one digit left of the decimal separator, and up to three digits right of the decimal separator, trailing zeros skipped: http://ideone.com/uV3nDi
Can I achieve this with iostreams, without writing my own formatting logic (e.g. special casing small numbers)?
I'm reasonably certain nothing built into iostreams supports this directly.
I think the cleanest way to handle it is to round the number before passing it to an iostream to be printed out:
#include <iostream>
#include <vector>
#include <cmath>
double rounded(double in, int places) {
double factor = std::pow(10, places);
return std::round(in * factor) / factor;
}
int main() {
std::vector<double> values{ 0.000000095123, 0.0095123, 0.95, 0.95123 };
for (auto i : values)
std::cout << "value = " << 100. * rounded(i, 5) << "%\n";
}
Due to the way it does rounding, this has a limitation on the magnitude of numbers it can work with. For percentages this probably isn't an issue, but if you were working with a number close to the largest that can be represented in the type in question (double in this case) the multiplication by pow(10, places) could/would overflow and produce bad results.
Though I can't be absolutely certain, it doesn't seem like this would be likely to cause an issue for the problem you seem to be trying to solve.
This solution is terrible.
I am serious. I don't like it. It's probably slow and the function has a stupid name. Maybe you can use it for test verification, though, because it's so dumb I guess you can easily see it pretty much has to work.
It also assumes decimal separator to be '.', which doesn't have to be the case. The proper point could be obtained by:
char point = std::use_facet< std::numpunct<char> >(std::cout.getloc()).decimal_point();
But that's still not solving the problem, because the characters used for digits could be different and in general this isn't something that should be written in such a way.
Here it is.
template<typename Floating>
std::string formatFloatingUpToN(unsigned n, Floating f) {
std::stringstream out;
out << std::setprecision(n) << std::fixed;
out << f;
std::string ret = out.str();
// if this clause holds, it's all zeroes
if (std::abs(f) < std::pow(0.1, n))
return ret;
while (true) {
if (ret.back() == '0') {
ret.pop_back();
continue;
} else if (ret.back() == '.') {
ret.pop_back();
break;
} else
break;
}
return ret;
}
And here it is in action.

Floating point limits code not producing correct results

I am racking my brain trying to figure out why this code does not get the right result. I am looking for the hexadecimal representations of the floating point positive and negative overflow/underflow levels. The code is based off this site and a Wikipedia entry:
7f7f ffff ≈ 3.4028234 × 1038 (max single precision) -- from wikipedia entry, corresponds to positive overflow
Here's the code:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
using namespace std;
int main(void) {
float two = 2;
float twentyThree = 23;
float one27 = 127;
float one49 = 149;
float posOverflow, negOverflow, posUnderflow, negUnderflow;
posOverflow = two - (pow(two, -twentyThree) * pow(two, one27));
negOverflow = -(two - (pow(two, one27) * pow(two, one27)));
negUnderflow = -pow(two, -one49);
posUnderflow = pow(two, -one49);
cout << "Positive overflow occurs when value greater than: " << hex << *(int*)&posOverflow << endl;
cout << "Neg overflow occurs when value less than: " << hex << *(int*)&negOverflow << endl;
cout << "Positive underflow occurs when value greater than: " << hex << *(int*)&posUnderflow << endl;
cout << "Neg overflow occurs when value greater than: " << hex << *(int*)&negUnderflow << endl;
}
The output is:
Positive overflow occurs when value greater than: f3800000
Neg overflow occurs when value less than: 7f800000
Positive underflow occurs when value greater than: 1
Neg overflow occurs when value greater than: 80000001
To get the hexadecimal representation of the floating point, I am using a method described here:
Why isn't the code working? I know it'll work if positive overflow = 7f7f ffff.
Your expression for the highest representable positive float is wrong. The page you linked uses (2-pow(2, -23)) * pow(2, 127), and you have 2 - (pow(2, -23) * pow(2, 127)). Similarly for the smallest representable negative float.
Your underflow expressions look correct, however, and so do the hexadecimal outputs for them.
Note that posOverflow and negOverflow are simply +FLT_MAX and -FLT_MAX. But note that your posUnderflow and negUnderflow are actually smaller than FLT_MIN(because they are denormal, and FLT_MIN is the smallest positive normal float).
Floating point loses precision as the number gets bigger. A number of the magnitude 2127 does not change when you add 2 to it.
Other than that, I'm not really following your code. Using words to spell out numbers makes it hard for me to read.
Here is the standard way to get the floating-point limits of your machine:
#include <limits>
#include <iostream>
#include <iomanip>
std::ostream &show_float( std::ostream &s, float f ) {
s << f << " = ";
std::ostream s_hex( s.rdbuf() );
s_hex << std::hex << std::setfill( '0' );
for ( char const *c = reinterpret_cast< char const * >( & f );
c != reinterpret_cast< char const * >( & f + 1 );
++ c ) {
s_hex << std::setw( 2 ) << ( static_cast< unsigned int >( * c ) & 0xff );
}
return s;
}
int main() {
std::cout << std::hex;
std::cout << "Positive overflow occurs when value greater than: ";
show_float( std::cout, std::numeric_limits< float >::max() ) << '\n';
std::cout << "Neg overflow occurs when value less than: ";
show_float( std::cout, - std::numeric_limits< float >::max() ) << '\n';
std::cout << "Positive underflow occurs when value less than: ";
show_float( std::cout, std::numeric_limits< float >::denormal_min() ) << '\n';
std::cout << "Neg underflow occurs when value greater than: ";
show_float( std::cout, - std::numeric_limits< float >::min() ) << '\n';
}
output:
Positive overflow occurs when value greater than: 3.40282e+38 = ffff7f7f
Neg overflow occurs when value less than: -3.40282e+38 = ffff7fff
Positive underflow occurs when value less than: 1.17549e-38 = 00008000
Neg underflow occurs when value greater than: -1.17549e-38 = 00008080
The output depends on the endianness of the machine. Here the bytes are reversed due to little-endian order.
Note, "underflow" in this case isn't a catastrophic zero result, but just denormalization which gradually reduces precision. (It may be catastrophic to performance, though.) You might also check numeric_limits< float >::denorm_min() which produces 1.4013e-45 = 01000000.
Your code assumes integers have the same size as a float (so do all but a few of the posts on the page you've linked, btw.) You probably want something along the lines of:
for (size_t s = 0; s < sizeof(myVar); ++s) {
unsigned char *byte = reinterpret_cast<unsigned char*>(myVar)[s];
//sth byte is byte
}
that is, something akin to the templated code on that page.
Your compiler may not be using those specific IEEE 754 types. You'll need to check its documentation.
Also, consider using std::numeric_limits<float>.min()/max() or cfloat FLT_ constants for determining some of those values.