I am racking my brain trying to figure out why this code does not get the right result. I am looking for the hexadecimal representations of the floating point positive and negative overflow/underflow levels. The code is based off this site and a Wikipedia entry:
7f7f ffff ≈ 3.4028234 × 1038 (max single precision) -- from wikipedia entry, corresponds to positive overflow
Here's the code:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
using namespace std;
int main(void) {
float two = 2;
float twentyThree = 23;
float one27 = 127;
float one49 = 149;
float posOverflow, negOverflow, posUnderflow, negUnderflow;
posOverflow = two - (pow(two, -twentyThree) * pow(two, one27));
negOverflow = -(two - (pow(two, one27) * pow(two, one27)));
negUnderflow = -pow(two, -one49);
posUnderflow = pow(two, -one49);
cout << "Positive overflow occurs when value greater than: " << hex << *(int*)&posOverflow << endl;
cout << "Neg overflow occurs when value less than: " << hex << *(int*)&negOverflow << endl;
cout << "Positive underflow occurs when value greater than: " << hex << *(int*)&posUnderflow << endl;
cout << "Neg overflow occurs when value greater than: " << hex << *(int*)&negUnderflow << endl;
}
The output is:
Positive overflow occurs when value greater than: f3800000
Neg overflow occurs when value less than: 7f800000
Positive underflow occurs when value greater than: 1
Neg overflow occurs when value greater than: 80000001
To get the hexadecimal representation of the floating point, I am using a method described here:
Why isn't the code working? I know it'll work if positive overflow = 7f7f ffff.
Your expression for the highest representable positive float is wrong. The page you linked uses (2-pow(2, -23)) * pow(2, 127), and you have 2 - (pow(2, -23) * pow(2, 127)). Similarly for the smallest representable negative float.
Your underflow expressions look correct, however, and so do the hexadecimal outputs for them.
Note that posOverflow and negOverflow are simply +FLT_MAX and -FLT_MAX. But note that your posUnderflow and negUnderflow are actually smaller than FLT_MIN(because they are denormal, and FLT_MIN is the smallest positive normal float).
Floating point loses precision as the number gets bigger. A number of the magnitude 2127 does not change when you add 2 to it.
Other than that, I'm not really following your code. Using words to spell out numbers makes it hard for me to read.
Here is the standard way to get the floating-point limits of your machine:
#include <limits>
#include <iostream>
#include <iomanip>
std::ostream &show_float( std::ostream &s, float f ) {
s << f << " = ";
std::ostream s_hex( s.rdbuf() );
s_hex << std::hex << std::setfill( '0' );
for ( char const *c = reinterpret_cast< char const * >( & f );
c != reinterpret_cast< char const * >( & f + 1 );
++ c ) {
s_hex << std::setw( 2 ) << ( static_cast< unsigned int >( * c ) & 0xff );
}
return s;
}
int main() {
std::cout << std::hex;
std::cout << "Positive overflow occurs when value greater than: ";
show_float( std::cout, std::numeric_limits< float >::max() ) << '\n';
std::cout << "Neg overflow occurs when value less than: ";
show_float( std::cout, - std::numeric_limits< float >::max() ) << '\n';
std::cout << "Positive underflow occurs when value less than: ";
show_float( std::cout, std::numeric_limits< float >::denormal_min() ) << '\n';
std::cout << "Neg underflow occurs when value greater than: ";
show_float( std::cout, - std::numeric_limits< float >::min() ) << '\n';
}
output:
Positive overflow occurs when value greater than: 3.40282e+38 = ffff7f7f
Neg overflow occurs when value less than: -3.40282e+38 = ffff7fff
Positive underflow occurs when value less than: 1.17549e-38 = 00008000
Neg underflow occurs when value greater than: -1.17549e-38 = 00008080
The output depends on the endianness of the machine. Here the bytes are reversed due to little-endian order.
Note, "underflow" in this case isn't a catastrophic zero result, but just denormalization which gradually reduces precision. (It may be catastrophic to performance, though.) You might also check numeric_limits< float >::denorm_min() which produces 1.4013e-45 = 01000000.
Your code assumes integers have the same size as a float (so do all but a few of the posts on the page you've linked, btw.) You probably want something along the lines of:
for (size_t s = 0; s < sizeof(myVar); ++s) {
unsigned char *byte = reinterpret_cast<unsigned char*>(myVar)[s];
//sth byte is byte
}
that is, something akin to the templated code on that page.
Your compiler may not be using those specific IEEE 754 types. You'll need to check its documentation.
Also, consider using std::numeric_limits<float>.min()/max() or cfloat FLT_ constants for determining some of those values.
Related
#include <iostream>
#include <iomanip>
#include <vector>
using namespace std;
int(main){
std::vector<int> vObj;
float n = 0.59392;
int nCopy = n;
int temNum = 0;;
while (fmod(nCopy, 1) != 0) {
temNum = (nCopy * 10); cout << endl << nCopy << endl;
nCopy *= 10;
vObj.push_back(temNum);
cout << "\n\n Cycle\n\n";
cout << "Temp Num: " << temNum << "\n\nN: " << nCopy << endl;
}
return 0;
}
For example, I input 0.59392 but eventually when the code reaches the bottom, where it should be going
5939.2 and then go to
59392 and stop but for some reason
it keeps going.
yeah , so you have 3 major problems in your code , first of all : it's int main() not int(main) . second : the variable named **nCopy ** is not supposed to be a integer data type , third one : you have to know what the actual representation of the float number , but first this is my solution for your problem , it's not the best one , but it works for this case :
#include <iostream>
#include <iomanip>
#include <vector>
using namespace std;
int main() {
std::vector<int> vObj;
double n = 0.59392;
double nCopy = n;
int temNum = 0;;
while (fmod(nCopy, 1) != 0) {
temNum = (nCopy * 10); cout << endl << nCopy << endl;
nCopy *= 10;
vObj.push_back(temNum);
cout << "\n\n Cycle\n\n";
cout << "Temp Num: " << temNum << "\n\nN: " << nCopy << endl;
}
return 0;
}
so the explanation is as follow , the double data types gives higher precision than float , that's why I used double instead of float , but it will lack accuracy when the number becomes big .
second of : you have to how is float or double is represented , as the value 0.59392 is actually stored in the memory as value 0.593900024890899658203125 when using float according to IEEE 754 standard , so there are other types of decimals to solve this problem where the difference between them is as follow
Decimal representation gives lower accuracy but higher range with big numbers and high accuracy when talking about small numbers, most 2 used standards are binary integer decimal (BID) and densely packed decimal (DPD)
float and doubles gives higher accuracy than Decimal when talking about big numbers but lower range ,they follow IEEE 754 standard
Fixed-Point types have the lowest range but they are the most accurate one and they are the fastest ones
but unfortunately , C++ only supports float and double types of numbers , but I believe there is external libraries out there to define a decimal data type.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Consider the following code for integral types:
template <class T>
std::string as_binary_string( T value ) {
return std::bitset<sizeof( T ) * 8>( value ).to_string();
}
int main() {
unsigned char a(2);
char b(4);
unsigned short c(2);
short d(4);
unsigned int e(2);
int f(4);
unsigned long long g(2);
long long h(4);
std::cout << "a = " << +a << " " << as_binary_string( a ) << std::endl;
std::cout << "b = " << +b << " " << as_binary_string( b ) << std::endl;
std::cout << "c = " << c << " " << as_binary_string( c ) << std::endl;
std::cout << "d = " << c << " " << as_binary_string( d ) << std::endl;
std::cout << "e = " << e << " " << as_binary_string( e ) << std::endl;
std::cout << "f = " << f << " " << as_binary_string( f ) << std::endl;
std::cout << "g = " << g << " " << as_binary_string( g ) << std::endl;
std::cout << "h = " << h << " " << as_binary_string( h ) << std::endl;
std::cout << "\nPress any key and enter to quit.\n";
char q;
std::cin >> q;
return 0;
}
Pretty straight forward, works well and is quite simple.
EDIT
How would one go about writing a function to extract the binary or bit pattern of arbitrary floating point types at compile time?
When it comes to floats I have not found anything similar in any existing libraries of my own knowledge. I've searched google for days looking for one, so then I resorted into trying to write my own function without any success. I no longer have the attempted code available since I've originally asked this question so I can not exactly show you all of the different attempts of implementations along with their compiler - build errors. I was interested in trying to generate the bit pattern for floats in a generic way during compile time and wanted to integrate that into my existing class that seamlessly does the same for any integral type. As for the floating types themselves, I have taken into consideration the different formats as well as architecture endian. For my general purposes the standard IEEE versions of the floating point types is all that I should need to be concerned with.
iBug had suggested for me to write my own function when I originally asked this question, while I was in the attempt of trying to do so. I understand binary numbers, memory sizes, and the mathematics, but when trying to put it all together with how floating point types are stored in memory with their different parts {sign bit, base & exp } is where I was having the most trouble.
Since then with the suggestions those who have given a great answer - example I was able to write a function that would fit nicely into my already existing class template and now it works for my intended purposes.
What about writing one by yourself?
static_assert(sizeof(float) == sizeof(uint32_t));
static_assert(sizeof(double) == sizeof(uint64_t));
std::string as_binary_string( float value ) {
std::uint32_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(float) * 8>(t).to_string();
}
std::string as_binary_string( double value ) {
std::uint64_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(double) * 8>(t).to_string();
}
You may need to change the helper variable t in case the sizes for the floating point numbers are different.
You can alternatively copy them bit-by-bit. This is slower but serves for arbitrarily any type.
template <typename T>
std::string as_binary_string( T value )
{
const std::size_t nbytes = sizeof(T), nbits = nbytes * CHAR_BIT;
std::bitset<nbits> b;
std::uint8_t buf[nbytes];
std::memcpy(buf, &value, nbytes);
for(int i = 0; i < nbytes; ++i)
{
std::uint8_t cur = buf[i];
int offset = i * CHAR_BIT;
for(int bit = 0; bit < CHAR_BIT; ++bit)
{
b[offset] = cur & 1;
++offset; // Move to next bit in b
cur >>= 1; // Move to next bit in array
}
}
return b.to_string();
}
You said it doesn't need to be standard. So, here is what works in clang on my computer:
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
char *result;
result=new char[33];
fill(result,result+32,'0');
float input;
cin >>input;
asm(
"mov %0,%%eax\n"
"mov %1,%%rbx\n"
".intel_syntax\n"
"mov rcx,20h\n"
"loop_begin:\n"
"shr eax\n"
"jnc loop_end\n"
"inc byte ptr [rbx+rcx-1]\n"
"loop_end:\n"
"loop loop_begin\n"
".att_syntax\n"
:
: "m" (input), "m" (result)
);
cout <<result <<endl;
delete[] result;
return 0;
}
This code makes a bunch of assumptions about the computer architecture and I am not sure on how many computers it would work.
EDIT:
My computer is a 64-bit Mac-Air. This program basically works by allocating a 33-byte string and filling the first 32 bytes with '0' (the 33rd byte will automatically be '\0').
Then it uses inline assembly to store the float into a 32-bit register and then it repeatedly shifts it to the right by one bit.
If the last bit in the register was 1 before the shift, it gets stored into the carry flag.
The assembly code then checks the carry flag and, if it contains 1, it increases the corresponding byte in the string by 1.
Since it was previously initialized to '0', it will turn to '1'.
So, effectively, when the loop in the assembly is finished, the binary representation of a float is stored into a string.
This code only works for x64 (it uses 64-bit registers "rbx" and "rcx" to store the pointer and the counter for the loop), but I think it's easy to tweak it to work on other processors.
An IEEE floating point number looks like the following
sign exponent mantissa
1 bit 11 bits 52 bits
Note that there's a hidden 1 before the mantissa, and the exponent
is biased so 1023 = 0, not two's complement.
By memcpy()ing to a 64 bit unsigned integer you can then apply AND and
OR masks to get the bit pattern. The arrangement could be big endian
or little endian.
You can easily work out which arrangement you have by passing easy numbers
such as 1 or 2.
Generally people either use std::hexfloat or cast a pointer to the floating-point value to a pointer to an unsigned integer of the same size and print the indirected value in hex format. Both methods facilitate bit-level analysis of floating-point in a productive fashion.
You could roll your by casting the address of the float/double to a char and iterating it that way:
#include <memory>
#include <iostream>
#include <limits>
#include <iomanip>
template <typename T>
std::string getBits(T t) {
std::string returnString{""};
char *base{reinterpret_cast<char *>(std::addressof(t))};
char *tail{base + sizeof(t) - 1};
do {
for (int bits = std::numeric_limits<unsigned char>::digits - 1; bits >= 0; bits--) {
returnString += ( ((*tail) & (1 << bits)) ? '1' : '0');
}
} while (--tail >= base);
return returnString;
}
int main() {
float f{10.0};
double d{100.0};
double nd{-100.0};
std::cout << std::setprecision(1);
std::cout << getBits(f) << std::endl;
std::cout << getBits(d) << std::endl;
std::cout << getBits(nd) << std::endl;
}
Output on my machine (note the sign flip in the third output):
01000001001000000000000000000000
0100000001011001000000000000000000000000000000000000000000000000
1100000001011001000000000000000000000000000000000000000000000000
This code snippet in Visual Studio 2013:
double a = 0.0;
double b = -0.0;
cout << (a == b) << " " << a << " " << b;
prints 1 0 -0. What is the difference between a and b?
C++ does not guarantee to differentiate between +0 and -0. This is a feature of each particular number representation. The IEEE 754 standard for floating point arithmetic does make this distinction, which can be used to keep sign information even when numbers go to zero. std::numeric_limits does not directly tell you if you have possible signed zeroes. But if std::numeric_limits<double>::is_iec559 is true then you can in practice assume that you have IEEE 754 representation, and thus possibly negative zero.
Noted by “gmch” in a comment, the C++11 standard library way to check the sign of a zero is to use std::copysign, or more directly using std::signbit, e.g. as follows:
#include <iostream>
#include <math.h> // copysign, signbit
using namespace std;
auto main() -> int
{
double const z1 = +0.0;
double const z2 = -0.0;
cout << boolalpha;
cout << "z1 is " << (signbit( z1 )? "negative" : "positive") << "." << endl;
cout << "z2 is " << (signbit( z2 )? "negative" : "positive") << "." << endl;
}
Without copysign or signbit, e.g. for a C++03 compiler, one way to detect a negative zero z is to check whether 1.0/z is negative infinity, e.g. by checking if it's just negative.
#include <iostream>
using namespace std;
auto main() -> int
{
double const z1 = +0.0;
double const z2 = -0.0;
cout << boolalpha;
cout << "z1 is " << (1/z1 < 0? "negative" : "positive") << "." << endl;
cout << "z2 is " << (1/z2 < 0? "negative" : "positive") << "." << endl;
}
But while this will probably work in practice on most any implementation, it's formally *Undefined Behavior.
One needs to be sure that the expression evaluation will not trap.
*) C++11 §5.6/4 “If the second operand of / or % is zero the behavior is undefined”
See http://en.m.wikipedia.org/wiki/Signed_zero
In a nutshell, it is due to the sign being stored as a stand-alone bit in IEEE 754 floating point representation. This leads to being able to have a zero exponent and fractional portions but still have the sign bit set--thus a negative zero. This is a condition that wouldn't happen for signed integers which are stored in twos-complement.
I would like to output a floating-point number as a percentage, with up to three decimal places.
I know that iostreams have three different ways of presenting floats:
"default", which displays using either the rules of fixed or scientific, depending on the number of significant digits desired as defined by setprecision;
fixed, which displays a fixed number of decimal places defined by setprecision; and
scientific, which displays a fixed number of decimal places but using scientific notation, i.e. mantissa + exponent of the radix.
These three modes can be seen in effect with this code:
#include <iostream>
#include <iomanip>
int main() {
double d = 0.00000095;
double e = 0.95;
std::cout << std::setprecision(3);
std::cout.unsetf(std::ios::floatfield);
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
std::cout << std::fixed;
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
std::cout << std::scientific;
std::cout << "d = " << (100. * d) << "%\n";
std::cout << "e = " << (100. * e) << "%\n";
}
// output:
// d = 9.5e-05%
// e = 95%
// d = 0.000%
// e = 95.000%
// d = 9.500e-05%
// e = 9.500e+01%
None of these options satisfies me.
I would like to avoid any scientific notation here as it makes the percentages really hard to read. I want to keep at most three decimal places, and it's ok if very small values show up as zero. However, I would also like to avoid trailing zeros in fractional places for cases like 0.95 above: I want that to display as in the second line, as "95%".
In .NET, I can achieve this with a custom format string like "0.###%", which gives me a number formatted as a percentage with at least one digit left of the decimal separator, and up to three digits right of the decimal separator, trailing zeros skipped: http://ideone.com/uV3nDi
Can I achieve this with iostreams, without writing my own formatting logic (e.g. special casing small numbers)?
I'm reasonably certain nothing built into iostreams supports this directly.
I think the cleanest way to handle it is to round the number before passing it to an iostream to be printed out:
#include <iostream>
#include <vector>
#include <cmath>
double rounded(double in, int places) {
double factor = std::pow(10, places);
return std::round(in * factor) / factor;
}
int main() {
std::vector<double> values{ 0.000000095123, 0.0095123, 0.95, 0.95123 };
for (auto i : values)
std::cout << "value = " << 100. * rounded(i, 5) << "%\n";
}
Due to the way it does rounding, this has a limitation on the magnitude of numbers it can work with. For percentages this probably isn't an issue, but if you were working with a number close to the largest that can be represented in the type in question (double in this case) the multiplication by pow(10, places) could/would overflow and produce bad results.
Though I can't be absolutely certain, it doesn't seem like this would be likely to cause an issue for the problem you seem to be trying to solve.
This solution is terrible.
I am serious. I don't like it. It's probably slow and the function has a stupid name. Maybe you can use it for test verification, though, because it's so dumb I guess you can easily see it pretty much has to work.
It also assumes decimal separator to be '.', which doesn't have to be the case. The proper point could be obtained by:
char point = std::use_facet< std::numpunct<char> >(std::cout.getloc()).decimal_point();
But that's still not solving the problem, because the characters used for digits could be different and in general this isn't something that should be written in such a way.
Here it is.
template<typename Floating>
std::string formatFloatingUpToN(unsigned n, Floating f) {
std::stringstream out;
out << std::setprecision(n) << std::fixed;
out << f;
std::string ret = out.str();
// if this clause holds, it's all zeroes
if (std::abs(f) < std::pow(0.1, n))
return ret;
while (true) {
if (ret.back() == '0') {
ret.pop_back();
continue;
} else if (ret.back() == '.') {
ret.pop_back();
break;
} else
break;
}
return ret;
}
And here it is in action.
Out of nowhere I get quite a big result for this function... It should be very simple, but I can't see it now.
double prob_calculator_t::pimpl_t::B_full_term() const
{
double result = 0.0;
for (uint32_t j=0, j_end=U; j<j_end; j++)
{
uint32_t inhabited_columns = doc->row_sums[j];
// DEBUG
cout << "inhabited_columns: " << inhabited_columns << endl;
cout << "log_of_sum[j]: " << log_of_sum[j] << endl;
cout << "sum_of_log[j]: " << sum_of_log[j] << endl;
// end DEBUG
result += ( -inhabited_columns * log( log_of_sum[j] ) + sum_of_log[ j ] );
cout << "result: " << result << endl;
}
return result;
}
and where is the trace:
inhabited_columns: 1
log_of_sum[j]: 110.56
sum_of_log[j]: -2.81341
result: 2.02102e+10
inhabited_columns: 42
log_of_sum[j]: 110.56
sum_of_log[j]: -143.064
result: 4.04204e+10
Thanks for the help!
inhabited_columns is unsigned and I see a unary - just before it: -inhabited_columns.
(Note that unary - has a really high operator precedence; higher than * etc).
That is where your problem is! To quote Mike Seymour's answer:
When you negate it, the result is still unsigned; the value is reduced
modulo 232 to give a large positive value.
One fix would be to write
-(inhabited_columns * log(log_of_sum[j]))
as then the negation will be carried out in floating point
inhabited_columns is an unsigned type. When you negate it, the result is still unsigned; the value is reduced modulo 232 to give a large positive value.
You should change it to a sufficiently large signed type (maybe int32_t, if you're not going to have more than a couple of billion columns), or perhaps double since you're about to use it in double-precision arithmetic.