how to handle floating point imprecision? [duplicate]

how to handle floating point imprecision? [duplicate] - c++

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 2 years ago.
I am converting an array of bytes to a 32 bit floating point. Sometimes the numbers are slightly off.
Example:
10.1 becomes 10.100000381469727 when I serialize the value in RapidJSON. How can I normalize this?
I can't share that code. What I can share is this to prove it:
std::string BytesToHexString(
unsigned char* data,
size_t len
)
{
std::stringstream ss;
ss << std::hex << std::setfill('0');
for (size_t i = len - 1; i >= 0 && i < len; --i)
ss << std::setw(2) << static_cast<int>(data[i]);
return ss.str();
}
std::string FLOATvalueToHexString(
float value
)
{
union FloatToUChar {
float f;
unsigned char c[sizeof(float)];
};
FloatToUChar floatUnion;
floatUnion.f = value;
return BytesToHexString(
floatUnion.c,
sizeof(float)
);
}
int main()
{
std::string sFloatValue = "10.100000";
float fltValue = atof(sFloatValue.c_str());
std::string strHexFloatValue = FLOATvalueToHexString(fltValue);
std::cout << sFloatValue << " " << fltValue << " " << strHexFloatValue << std::endl;
return 0;
}
It prints: 10.100000 10.1 4121999a
The debugger says fltValue is 10.1000004.
If I convert 4121999a then this confirms that the internal storage is indeed off:
https://babbage.cs.qc.cuny.edu/IEEE-754.old/32bit.html
10.100000381469727
How can I normalize the floating point so I can at least get the correct hexadecimal value?

Just like an int type can't be used to non-whole numbers, a double can only store a subset of the real numbers too.
If you want to be able to store 0.1 exactly, then use a decimal type. See C++ decimal data types for a starting point.
Job done!

Related

Showing binary representation of floating point types in C++ [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Consider the following code for integral types:
template <class T>
std::string as_binary_string( T value ) {
return std::bitset<sizeof( T ) * 8>( value ).to_string();
}
int main() {
unsigned char a(2);
char b(4);
unsigned short c(2);
short d(4);
unsigned int e(2);
int f(4);
unsigned long long g(2);
long long h(4);
std::cout << "a = " << +a << " " << as_binary_string( a ) << std::endl;
std::cout << "b = " << +b << " " << as_binary_string( b ) << std::endl;
std::cout << "c = " << c << " " << as_binary_string( c ) << std::endl;
std::cout << "d = " << c << " " << as_binary_string( d ) << std::endl;
std::cout << "e = " << e << " " << as_binary_string( e ) << std::endl;
std::cout << "f = " << f << " " << as_binary_string( f ) << std::endl;
std::cout << "g = " << g << " " << as_binary_string( g ) << std::endl;
std::cout << "h = " << h << " " << as_binary_string( h ) << std::endl;
std::cout << "\nPress any key and enter to quit.\n";
char q;
std::cin >> q;
return 0;
}
Pretty straight forward, works well and is quite simple.
EDIT
How would one go about writing a function to extract the binary or bit pattern of arbitrary floating point types at compile time?
When it comes to floats I have not found anything similar in any existing libraries of my own knowledge. I've searched google for days looking for one, so then I resorted into trying to write my own function without any success. I no longer have the attempted code available since I've originally asked this question so I can not exactly show you all of the different attempts of implementations along with their compiler - build errors. I was interested in trying to generate the bit pattern for floats in a generic way during compile time and wanted to integrate that into my existing class that seamlessly does the same for any integral type. As for the floating types themselves, I have taken into consideration the different formats as well as architecture endian. For my general purposes the standard IEEE versions of the floating point types is all that I should need to be concerned with.
iBug had suggested for me to write my own function when I originally asked this question, while I was in the attempt of trying to do so. I understand binary numbers, memory sizes, and the mathematics, but when trying to put it all together with how floating point types are stored in memory with their different parts {sign bit, base & exp } is where I was having the most trouble.
Since then with the suggestions those who have given a great answer - example I was able to write a function that would fit nicely into my already existing class template and now it works for my intended purposes.

What about writing one by yourself?
static_assert(sizeof(float) == sizeof(uint32_t));
static_assert(sizeof(double) == sizeof(uint64_t));
std::string as_binary_string( float value ) {
std::uint32_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(float) * 8>(t).to_string();
}
std::string as_binary_string( double value ) {
std::uint64_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(double) * 8>(t).to_string();
}
You may need to change the helper variable t in case the sizes for the floating point numbers are different.
You can alternatively copy them bit-by-bit. This is slower but serves for arbitrarily any type.
template <typename T>
std::string as_binary_string( T value )
{
const std::size_t nbytes = sizeof(T), nbits = nbytes * CHAR_BIT;
std::bitset<nbits> b;
std::uint8_t buf[nbytes];
std::memcpy(buf, &value, nbytes);
for(int i = 0; i < nbytes; ++i)
{
std::uint8_t cur = buf[i];
int offset = i * CHAR_BIT;
for(int bit = 0; bit < CHAR_BIT; ++bit)
{
b[offset] = cur & 1;
++offset; // Move to next bit in b
cur >>= 1; // Move to next bit in array
}
}
return b.to_string();
}

You said it doesn't need to be standard. So, here is what works in clang on my computer:
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
char *result;
result=new char[33];
fill(result,result+32,'0');
float input;
cin >>input;
asm(
"mov %0,%%eax\n"
"mov %1,%%rbx\n"
".intel_syntax\n"
"mov rcx,20h\n"
"loop_begin:\n"
"shr eax\n"
"jnc loop_end\n"
"inc byte ptr [rbx+rcx-1]\n"
"loop_end:\n"
"loop loop_begin\n"
".att_syntax\n"
:
: "m" (input), "m" (result)
);
cout <<result <<endl;
delete[] result;
return 0;
}
This code makes a bunch of assumptions about the computer architecture and I am not sure on how many computers it would work.
EDIT:
My computer is a 64-bit Mac-Air. This program basically works by allocating a 33-byte string and filling the first 32 bytes with '0' (the 33rd byte will automatically be '\0').
Then it uses inline assembly to store the float into a 32-bit register and then it repeatedly shifts it to the right by one bit.
If the last bit in the register was 1 before the shift, it gets stored into the carry flag.
The assembly code then checks the carry flag and, if it contains 1, it increases the corresponding byte in the string by 1.
Since it was previously initialized to '0', it will turn to '1'.
So, effectively, when the loop in the assembly is finished, the binary representation of a float is stored into a string.
This code only works for x64 (it uses 64-bit registers "rbx" and "rcx" to store the pointer and the counter for the loop), but I think it's easy to tweak it to work on other processors.

An IEEE floating point number looks like the following
sign exponent mantissa
1 bit 11 bits 52 bits
Note that there's a hidden 1 before the mantissa, and the exponent
is biased so 1023 = 0, not two's complement.
By memcpy()ing to a 64 bit unsigned integer you can then apply AND and
OR masks to get the bit pattern. The arrangement could be big endian
or little endian.
You can easily work out which arrangement you have by passing easy numbers
such as 1 or 2.

Generally people either use std::hexfloat or cast a pointer to the floating-point value to a pointer to an unsigned integer of the same size and print the indirected value in hex format. Both methods facilitate bit-level analysis of floating-point in a productive fashion.

You could roll your by casting the address of the float/double to a char and iterating it that way:
#include <memory>
#include <iostream>
#include <limits>
#include <iomanip>
template <typename T>
std::string getBits(T t) {
std::string returnString{""};
char *base{reinterpret_cast<char *>(std::addressof(t))};
char *tail{base + sizeof(t) - 1};
do {
for (int bits = std::numeric_limits<unsigned char>::digits - 1; bits >= 0; bits--) {
returnString += ( ((*tail) & (1 << bits)) ? '1' : '0');
}
} while (--tail >= base);
return returnString;
}
int main() {
float f{10.0};
double d{100.0};
double nd{-100.0};
std::cout << std::setprecision(1);
std::cout << getBits(f) << std::endl;
std::cout << getBits(d) << std::endl;
std::cout << getBits(nd) << std::endl;
}
Output on my machine (note the sign flip in the third output):
01000001001000000000000000000000
0100000001011001000000000000000000000000000000000000000000000000
1100000001011001000000000000000000000000000000000000000000000000

Each deserialized 64 bit integer number should be converted to bit wise equivalent 64 bit floating number

I have above statement in file I am refering . Expected output is double. I could not find anything relevant to my problem.
I found this
Passing a structure through Sockets in C
but dont know if its relevant.
I am not reading that int64 value. I am getting it from other process and that is the way it is designed.
Does anyone have any theory about serialization and deserialization of ints?

There is exactly one defined way to bitwise-copy one type into another in c++ - memcpy.
template<class Out, class In, std::enable_if_t<(sizeof(In) == sizeof(Out))>* = nullptr>
Out mangle(const In& in)
{
Out result;
std::memcpy(std::addressof(result), std::addressof(in), sizeof(Out));
return result;
}
int main()
{
double a = 1.1;
auto b = mangle<std::uint64_t>(a);
auto c = mangle<double>(b);
std::cout << a << " " << std::hex << b << " " << c << std::endl;
}
example output:
1.1 3ff199999999999a 1.1

How about reading that 64-bit number and using reinterpret_cast to convert it to bitwise equivalent floating point number.
int64_t a = 121314;
double b = *reinterpret_cast<double*>(&a);
int64_t c = *reinterpret_cast<int64_t*>(&b);
assert(a==c);

Convert to hex and delete last two digits

Hello lets say i number 1314173089 as decimal and 0x4E54B0A1 as hexadecimal.When i use printf, it converts correctly to hexadecimal by using 0x%X. I would really want to convert somehow my number to hexadecimal and then remove for example last two digits from the hexadecimal number so it will be 0x4E54B0 as hex, but in decimal it shall be 5133488 and i want to have the decimal number stored in a int for another things - could someone give me a hand? So far i could only printf it but i dont know how would i do such a hex function myself..

Simply divide by 0x100:
#include <iostream>
int main()
{
const unsigned int a = 0x4E54B0A1;
std::cout << "0x" << std::hex << a / 0x100 << std::endl;
}
This prints 0x4e54b0.

unsigned int hexFunction(const unsigned int a) {
return a / 0x100;
}
int main()
{
const unsigned int a = 0x4E54B0A1;
unsigned int hex = hexFunction(a);
std::cout << "Hex = 0x" << std::hex << hex;
std::cout << "\tDec = " << std::dec << hex << std::endl;
return 0;
}

double to string conversion with fixed width

I would like to print a double value, into a string of no more than 8 characters. The printed number should have as many digits as possible, e.g.
5.259675
48920568
8.514e-6
-9.4e-12
I tried C++ iostreams, and printf-style, and neither respects the provided size in the way I would like it to:
cout << setw(8) << 1.0 / 17777.0 << endl;
printf( "%8g\n", 1.0 / 17777.0 );
gives:
5.62525e-005
5.62525e-005
I know I can specify a precision, but I would have to provide a very small precision here, in order to cover the worst case. Any ideas how to enforce an exact field width without sacrificing too much precision? I need this for printing matrices. Do I really have to come up with my own conversion function?
A similar question has been asked 5 years ago: Convert double to String with fixed width , without a satisfying answer. I sure hope there has been some progress in the meantime.

This seems not too difficult, actually, although you can't do it in a single function call. The number of character places used by the exponent is really quite easy to predict:
const char* format;
if (value > 0) {
if (value < 10e-100) format = "%.1e";
else if (value < 10e-10) format = "%.2e";
else if (value < 1e-5) format = "%.3e";
}
and so on.
Only, the C standard, where the behavior of printf is defined, insists on at least two digits for the exponent, so it wastes some there. See c++ how to get "one digit exponent" with printf
Incorporating those fixes is going to make the code fairly complex, although still not as bad as doing the conversion yourself.

If you want to convert to fixed decimal numbers (e.g. drop the +/-"E" part), then it makes it a lot easier to accomplish:
#include <stdio.h>
#include <cstring> // strcpy
#include <iostream> // std::cout, std::fixed
#include <iomanip> // std::setprecision
#include <new>
char *ToDecimal(double val, int maxChars)
{
std::ostringstream buffer;
buffer << std::fixed << std::setprecision(maxChars-2) << val;
std::string result = buffer.str();
size_t i = result.find_last_not_of('\0');
if (i > maxChars) i = maxChars;
if (result[i] != '.') ++i;
result.erase(i);
char *doubleStr = new char[result.length() + 1];
strcpy(doubleStr, (const char*)result.c_str());
return doubleStr;
}
int main()
{
std::cout << ToDecimal(1.26743237e+015, 8) << std::endl;
std::cout << ToDecimal(-1.0, 8) << std::endl;
std::cout << ToDecimal(3.40282347e+38, 8) << std::endl;
std::cout << ToDecimal(1.17549435e-38, 8) << std::endl;
std::cout << ToDecimal(-1E4, 8) << std::endl;
std::cout << ToDecimal(12.78e-2, 8) << std::endl;
}
Output:
12674323
-1
34028234
0.000000
-10000
0.127800

Converting remainder of a floating point float/double into an int

I have a double (or float) number x:
x = 1234.5678;
Now, the question is, how do break down the number into 2 int's whereas int1 would get the number before the point, and int2 is the number after the point.
The first part is easy, which we can either cast, or take a round or ceiling to get the first part into an int, I am looking for the second part to become int2=5678 without any floating points there.
i.e. to to extend the above example:
float x = 1234.5678;
int x1 = (int) x; // which would return 1234 great.
int x2 = SomeFunction????(x); // where I need x2 to become = 5678
Notice the 5678 should not have any points there.
It would be nice to hear from you.
Thanks
Heider

Here are two ways of doing it.
The first one uses std::stringstream, std::string and std::strtol and is sorta hacky. It is also not very efficient, but it does the job.
The second one needs to know the number of decimals and uses simple multiplication. NOTE: This method will not do any kind of checking on whether the float you pass in actually has that number of decimals.
None of these methods are particularly elegant, but they worked well for the numbers I tested ( both positive and negative. ) Feel free to comment if you find bugs/errors or if you have suggestions for improvement.
EDIT : As #dan04 pointed out, this method will return the same value for 0.4 as for 0.04. If you want do distinguish these, you'd need a second int for storing the number of zeros after the decimal point.
#include <iostream>
#include <sstream>
#include <math.h>
int GetDecimalsUsingString( float number );
int GetDecimals( float number, int num_decimals );
int main() {
float x = 1234.5678;
int x1 = (int) x; // which would return 1234 great.
float remainder = x - static_cast< float > ( x1 );
std::cout << "Original : " << x << std::endl;
std::cout << "Before comma : " << x1 << std::endl;
std::cout << "Remainder : " << remainder << std::endl;
// "Ugly" way using std::stringstream and std::string
int res_string = GetDecimalsUsingString( remainder );
// Nicer, but requires that you specify number of decimals
int res_num_decimals = GetDecimals( remainder, 5 );
std::cout << "Result using string : " << res_string << std::endl;
std::cout << "Result using known number of decimals : " << res_num_decimals << std::endl;
return 0;
}
int GetDecimalsUsingString( float number )
{
// Put number in a stringstream
std::stringstream ss;
ss << number;
// Put content of stringstream into a string
std::string str = ss.str();
// Remove the first part of the string ( minus symbol, 0 and decimal point)
if ( number < 0.0 )
str = str.substr( 3, str.length() - 1);
else
str = str.substr( 2, str.length() - 1);
// Convert string back to int
int ret = std::strtol( str.c_str(), NULL, 10 );
/// Preserve sign
if ( number < 0 )
ret *= -1;
return ret;
}
int GetDecimals( float number, int num_decimals )
{
int decimal_multiplier = pow( 10, num_decimals );
int result = number * decimal_multiplier;
return result;
}
Output :
Original : 1234.57
Before comma : 1234
Remainder : 0.567749
Result using string : 567749
Result using known number of decimals : 56774
Ideone

I guess there are no built in C/C++ commands to do this, other than the 2 methods of:
1) Using the above to convert into string and then scan back into 2 ints.
2) Accessing the memory contents of the memory variable and then decoding manually.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to handle floating point imprecision? [duplicate] - c++

Just like an int type can't be used to non-whole numbers, a double can only store a subset of the real numbers too. If you want to be able to store 0.1 exactly, then use a decimal type. See C++ decimal data types for a starting point. Job done!

Related

Showing binary representation of floating point types in C++ [closed]

Each deserialized 64 bit integer number should be converted to bit wise equivalent 64 bit floating number

Convert to hex and delete last two digits

double to string conversion with fixed width

Converting remainder of a floating point float/double into an int

Categories

Resources