Rounding error of binary32

Rounding error of binary32 - c++

As part of a homework, I'm writing a program that takes a float decimal number as input entered from terminal, and return IEEE754 binary32 of that number AND return 1 if the binary exactly represents the number, 0 otherwise. We are only allowed to use iostream and cmath.
I already wrote the part that returns binary32 format, but I don't understand how to see if there's rounding to that format.
My idea to see the rounding was to calculate the decimal number back from binary32 form and compare it with the original number. But I am having difficulty with saving the returned binary32 as some type of data, since I can't use the vector header. I've tried using for loops and pow, but I still get the indices wrong.
Also, I'm having trouble understanding what exactly is df or *df? I wrote the code myself, but I only know that I needed to convert address pointed to float to address pointed to char.
My other idea was to compare binary32 and binary 64, which gives more precision. And again, I don't know how to do this without using vector?
int main(int argc, char* argv[]){
int i ,j;
float num;
num = atof(argv[1]);
char* numf = (char*)(&num);
for (i = sizeof(float) - 1; i >= 0; i--){
for (j = 7; j >= 0; j--)
if (numf[i] & (1 << j)) {
cout << "1";
}else{
cout << "0";
}
}
cout << endl;
}
//////
Update:
Since there's no other way around without using header files, I hard coded for loops to convert binary32 back to decimal.
Since x = 1.b31b30...b0 * 2^p. One for loop for finding the exponent and one for loop for finding the significand.

Basic idea: Convert your number d back to a string (eg. with to_string) and compare it to the input. If the strings are different, there was some loss because of the limitations of float.
Of course, this means your input always has to be in the same string format that to_string uses. No additional unneeded 0's, no whitespaces, etc.
...
That said, doing the float conversion without cast (but with manually parsing the input and calculating the IEEE754 bits) is more work initally, but in return, it sovled this problem automatically. And, as noted in the comments, your cast might not work the way you want.

Related

float number to string converting implementation in STD

I faced with a curious issue. Look at this simple code:
int main(int argc, char **argv) {
char buf[1000];
snprintf_l(buf, sizeof(buf), _LIBCPP_GET_C_LOCALE, "%.17f", 0.123e30f);
std::cout << "WTF?: " << buf << std::endl;
}
The output looks quire wired:
123000004117574256822262431744.00000000000000000
My question is how it's implemented? Can someone show me the original code? I did not find it. Or maybe it's too complicated for me.
I've tried to reimplement the same transformation double to string with Java code but was failed. Even when I tried to get exponent and fraction parts separately and summarize fractions in cycle I always get zeros instead of these numbers "...822262431744". When I tried to continue summarizing fractions after the 23 bits (for float number) I faced with other issue - how many fractions I need to collect? Why the original code stops on left part and does not continue until the scale is end?
So, I really do not understand the basic logic, how it implemented. I've tried to define really big numbers (e.g. 0.123e127f). And it generates huge number in decimal format. The number has much higher precision than float can be. Looks like this is an issue, because the string representation contains something which float number cannot.

Please read documentation:
printf, fprintf, sprintf, snprintf, printf_s, fprintf_s, sprintf_s, snprintf_s - cppreference.com
The format string consists of ordinary multibyte characters (except %), which are copied unchanged into the output stream, and conversion specifications. Each conversion specification has the following format:
introductory % character
...
(optional) . followed by integer number or *, or neither that specifies precision of the conversion. In the case when * is used, the precision is specified by an additional argument of type int, which appears before the argument to be converted, but after the argument supplying minimum field width if one is supplied. If the value of this argument is negative, it is ignored. If neither a number nor * is used, the precision is taken as zero. See the table below for exact effects of precision.
....
Conversion Specifier
Explanation
Expected Argument Type
f F
converts floating-point number to the decimal notation in the style [-]ddd.ddd. Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.
double
So with f you forced form ddd.ddd (no exponent) and with .17 you have forced to show 17 digits after decimal separator. With such big value printed outcome looks that odd.

Finally I've found out what the difference between Java float -> decimal -> string convertation and c++ float -> string (decimal) convertation. I did not find the original source code, but I replicated the same code in Java to make it clear. I think the code explains everything:
// the context size might be calculated properly by getting maximum
// float number (including exponent value) - its 40 + scale, 17 for me
MathContext context = new MathContext(57, RoundingMode.HALF_UP);
BigDecimal divisor = BigDecimal.valueOf(2);
int tmp = Float.floatToRawIntBits(1.23e30f)
boolean sign = tmp < 0;
tmp <<= 1;
// there might be NaN value, this code does not support it
int exponent = (tmp >>> 24) - 127;
tmp <<= 8;
int mask = 1 << 23;
int fraction = mask | (tmp >>> 9);
// at this line we have all parts of the float: sign, exponent and fractions. Let's build mantissa
BigDecimal mantissa = BigDecimal.ZERO;
for (int i = 0; i < 24; i ++) {
if ((fraction & mask) == mask) {
// i'm not sure about speed, maybe division at each iteration might be faster than pow
mantissa = mantissa.add(divisor.pow(-i, context));
}
mask >>>= 1;
}
// it was the core line where I was losing accuracy, because of context
BigDecimal decimal = mantissa.multiply(divisor.pow(exponent, context), context);
String str = decimal.setScale(17, RoundingMode.HALF_UP).toPlainString();
// add minus manually, because java lost it if after the scale value become 0, C++ version of code doesn't do it
if (sign) {
str = "-" + str;
}
return str;
Maybe topic is useless. Who really need to have the same implementation like C++ has? But at least this code keeps all precision for float number comparing to the most popular way converting float to decimal string:
return BigDecimal.valueOf(1.23e30f).setScale(17, RoundingMode.HALF_UP).toPlainString();

The C++ implementation you are using uses the IEEE-754 binary32 format for float. In this format, the closet representable value to 0.123•1030 is 123,000,004,117,574,256,822,262,431,744, which is represented in the binary32 format as +13,023,132•273. So 0.123e30f in the source code yields the number 123,000,004,117,574,256,822,262,431,744. (Because the number is represented as +13,023,132•273, we know its value is that exactly, which is 123,000,004,117,574,256,822,262,431,744, even though the digits “123000004117574256822262431744” are not stored directly.)
Then, when you format it with %.17f, your C++ implementation prints the exact value faithfully, yielding “123000004117574256822262431744.00000000000000000”. This accuracy is not required by the C++ standard, and some C++ implementations will not do the conversion exactly.
The Java specification also does not require formatting of floating-point values to be exact, at least in some formatting operations. (I am going from memory and some supposition here; I do not have a citation at hand.) It allows, perhaps even requires, that only a certain number of correct digits be produced, after which zeros are used if needed for positioning relative to the decimal point or for the requested format.
The number has much higher precision than float can be.
For any value represented in the float format, that value has infinite precision. The number +13,023,132•273 is exactly +13,023,132•273, which is exactly 123,000,004,117,574,256,822,262,431,744, to infinite precision. The precision the format has for representing numbers affects only which numbers it can represent, not how precisely it represents the numbers that it does represent.

C++ Modulus returning wrong answer

Here is my code :
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
int n, i, num, m, k = 0;
cout << "Enter a number :\n";
cin >> num;
n = log10(num);
while (n > 0) {
i = pow(10, n);
m = num / i;
k = k + pow(m, 3);
num = num % i;
--n;
cout << m << endl;
cout << num << endl;
}
k = k + pow(num, 3);
return 0;
}
When I input 111 it gives me this
1
12
1
2
I am using codeblocks. I don't know what is wrong.

Whenever I use pow expecting an integer result, I add .5 so I use (int)(pow(10,m)+.5) instead of letting the compiler automatically convert pow(10,m) to an int.
I have read many places telling me others have done exhaustive tests of some of the situations in which I add that .5 and found zero cases where it makes a difference. But accurately identifying the conditions in which it isn't needed can be quite hard. Using it when it isn't needed does no real harm.
If it makes a difference, it is a difference you want. If it doesn't make a difference, it had a tiny cost.
In the posted code, I would adjust every call to pow that way, not just the one I used as an example.
There is no equally easy fix for your use of log10, but it may be subject to the same problem. Since you expect a non integer answer and want that non integer answer truncated down to an integer, adding .5 would be very wrong. So you may need to find some more complicated work around for the fundamental problem of working with floating point. I'm not certain, but assuming 32-bit integers, I think adding 1e-10 to the result of log10 before converting to int is both never enough to change log10(10^n-1) into log10(10^n) but always enough to correct the error that might have done the reverse.

pow does floating-point exponentiation.
Floating point functions and operations are inexact, you cannot ever rely on them to give you the exact value that they would appear to compute, unless you are an expert on the fine details of IEEE floating point representations and the guarantees given by your library functions.
(and furthermore, floating-point numbers might even be incapable of representing the integers you want exactly)
This is particularly problematic when you convert the result to an integer, because the result is truncated to zero: int x = 0.999999; sets x == 0, not x == 1. Even the tiniest error in the wrong direction completely spoils the result.
You could round to the nearest integer, but that has problems too; e.g. with sufficiently large numbers, your floating point numbers might not have enough precision to be near the result you want. Or if you do enough operations (or unstable operations) with the floating point numbers, the errors can accumulate to the point you get the wrong nearest integer.
If you want to do exact, integer arithmetic, then you should use functions that do so. e.g. write your own ipow function that computes integer exponentiation without any floating-point operations at all.

Weird Rounding Occurs in C++ Function

I am writing a function in c++ that is supposed to find the largest single digit in the number passed (inputValue). For example, the answer for .345 is 5. However, after a while, the program is changing the inputValue to something along the lines of .3449 (and the largest digit is then set to 9). I have no idea why this is happening. Any help to resolve this problem would be greatly appreciated.
This is the function in my .hpp file
void LargeInput(const double inputValue)
//Function to find the largest value of the input
{
int tempMax = 0,//Value that the temporary max number is in loop
digit = 0,//Value of numbers after the decimal place
test = 0,
powerOten = 10;//Number multiplied by so that the next digit can be checked
double number = inputValue;//A variable that can be changed in the function
cout << "The number is still " << number << endl;
for (int k = 1; k <= 6; k++)
{
test = (number*powerOten);
cout << "test: " << test << endl;
digit = test % 10;
cout << (static_cast<int>(number*powerOten)) << endl;
if (tempMax < digit)
tempMax = digit;
powerOten *= 10;
}
return;
}

You cannot represent real numbers (doubles) precisely in a computer - they need to be approximated. If you change your function to work on longs or ints there won't be any inaccuracies. That seems natural enough for the context of your question, you're just looking at the digits and not the number, so .345 can be 345 and get the same result.
Try this:
int get_largest_digit(int n) {
int largest = 0;
while (n > 0) {
int x = n % 10;
if (x > largest) largest = x;
n /= 10;
}
return largest;
}

This is because the fractional component of real numbers is in the form of 1/2^n. As a result you can get values very close to what you want but you can never achieve exact values like 1/3.
It's common to instead use integers and have a conversion (like 1000 = 1) so if you had the number 1333 you would do printf("%d.%d", 1333/1000, 1333 % 1000) to print out 1.333.
By the way the first sentence is a simplification of how floating point numbers are actually represented. For more information check out; http://en.wikipedia.org/wiki/Floating_point#Representable_numbers.2C_conversion_and_rounding

This is how floating point number work, unfortunately. The core of the problem is that there are an infinite number of floating point numbers. More specifically, there are an infinite number of values between 0.1 and 0.2 and there are an infinite number of values between 0.01 and 0.02. Computers, however, have a finite number of bits to represent a floating point number (64 bits for a double precision number). Therefore, most floating point numbers have to be approximated. After any floating point operation, the processor has to round the result to a value it can represent in 64 bits.
Another property of floating point numbers is that as number get bigger they get less and less precise. This is because the same 64 bits have to be able to represent very big numbers (1,000,000,000) and very small numbers (0.000,000,000,001). Therefore, the rounding error gets larger when working with bigger numbers.
The other issue here is that you are converting from floating point to integer. This introduces even more rounding error. It appears that when (0.345 * 10000) is converted to an integer, the result is closer to 3449 than 3450.
I suggest you don't convert your numbers to integers. Write your program in terms of floating point numbers. You can't use the modulus (%) operator on floating point numbers to get a value for digit. Instead use the fmod function in the C math library (cmath.h).

As other answers have indicated, binary floating-point is incapable of representing most decimal numbers exactly. Therefore, you must reconsider your problem statement. Some alternatives are:
The number is passed as a double (specifically, a 64-bit IEEE-754 binary floating-point value), and you wish to find the largest digit in the decimal representation of the exact value passed. In this case, the solution suggested by user millimoose will work (provided the asprintf or snprintf function used is of good quality, so that it does not incur rounding errors that prevent it from producing correctly rounded output).
The number is passed as a double but is intended to represent a number that is exactly representable as a decimal numeral with a known number of digits. In this case, the solution suggested by user millimoose again works, with the format specification altered to convert the double to decimal with the desired number of digits (e.g., instead of “%.64f”, you could use “%.6f”).
The function is changed to pass the number in another way, such as with decimal floating-point, as a scaled integer, or as a string containing a decimal numeral.
Once you have clarified the problem statement, it may be interesting to consider how to solve it with floating-point arithmetic, rather than calling library functions for formatted output. This is likely to have pedagogical value (and incidentally might produce a solution that is computationally more efficient than calling a library function).

Why do simple doubles like 1.82 end up being 1.819999999645634565360? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why does Visual Studio 2008 tell me .9 - .8999999999999995 = 0.00000000000000055511151231257827?
c++
Hey so i'm making a function to return the number of a digits in a number data type given, but i'm having some trouble with doubles.
I figure out how many digits are in it by multiplying it by like 10 billion and then taking away digits 1 by 1 until the double ends up being 0. however when putting in a double of value say .7904 i never exit the function as it keeps taking away digits which never end up being 0 as the resut of .7904 ends up being 7,903,999,988 and not 7,904,000,000.
How can i solve this problem?? Thanks =) ! oh and any other feed back on my code is WELCOME!
here's the code of my function:
/////////////////////// Numb_Digits() ////////////////////////////////////////////////////
enum{DECIMALS = 10, WHOLE_NUMBS = 20, ALL = 30};
template<typename T>
unsigned long int Numb_Digits(T numb, int scope)
{
unsigned long int length= 0;
switch(scope){
case DECIMALS: numb-= (int)numb; numb*=10000000000; // 10 bil (10 zeros)
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case WHOLE_NUMBS: numb= (int)numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case ALL: numb = numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
default: break;}
return length;
};
int main()
{
double test = 345.6457;
cout << Numb_Digits(test, ALL) << endl;
cout << Numb_Digits(test, DECIMALS) << endl;
cout << Numb_Digits(test, WHOLE_NUMBS) << endl;
return 0;
}

It's because of their binary representation, which is discussed in depth here:
http://en.wikipedia.org/wiki/IEEE_754-2008
Basically, when a number can't be represented as is, an approximation is used instead.
To compare floats for equality, check if their difference is lesser than an arbitrary precision.

The easy summary about floating point arithmetic :
http://floating-point-gui.de/
Read this and you'll see the light.
If you're more on the math side, Goldberg paper is always nice :
http://cr.yp.to/2005-590/goldberg.pdf
Long story short : real numbers are stored with a fixed, irregular precision, leading to non obvious behaviors. This is unrelated to the language but more a design choice of how to handle real numbers as a whole.

This is because C++ (like most other languages) can not store floating point numbers with infinte precision.
Floating points are stored like this:
sign * coefficient * 10^exponent if you're using base 10.
The problem is that both the coefficient and exponent are stored as finite integers.
This is a common problem with storing floating point in computer programs, you usually get a tiny rounding error.
The most common way of dealing with this is:
Store the number as a fraction (x/y)
Use a delta that allows small deviations (if abs(x-y) < delta)
Use a third party library such as GMP that can store floating point with perfect precision.
Regarding your question about counting decimals.
There is no way of dealing with this if you get a double as input. You cannot be sure that the user actually sent 1.819999999645634565360 and not 1.82.
Either you have to change your input or change the way your function works.
More info on floating point can be found here: http://en.wikipedia.org/wiki/Floating_point

This is because of the way the IEEE floating point standard is implemented, which will vary depending on operations. It is an approximation of precision. Never use logic of if(float == float), ever!

Float numbers are represented in the form Significant digits × baseexponent(IEEE 754). In your case, float 1.82 = 1 + 0.5 + 0.25 + 0.0625 + ...
Since only a limited digits could be stored, therefore there will be a round error if the float number cannot be represented as a terminating expansion in the relevant base (base 2 in the case).

You should always check relative differences with floating point numbers, not absolute values.
You need to read this, too.

Computers don't store floating point numbers exactly. To accomplish what you are doing, you could store the original input as a string, and count the number of characters.

How to convert string (22.123) format number into float variable format without using any API in c++

How to convert string (22.123) format number into float variable format without using any API in c++. This is just to understand more about the inside coding.. thnx

something like:
double string_to_double(std::string s)
{
int p = 0;
int p_dec = s.length();
double val = 0;
for (int i=0; i<s.length(); ++i)
{
double digit = (double)(s[i] - '0');
if (s[i] == '.') { p_dec = p; }
else { val += digit*powf(10,p--); }
}
val /= powf(10, p_dec);
}

Basic algorithm, assuming no input in the form 1.2e-4:
(1) Read an integer before the dot. If the number of digits is > 16 (normal precision of double), convert that integer into floating point directly and return.
(2) Read an at most 16 digits dot as an integer. Compute (that integer) ÷ 10digits read. Sum up this with the integer in step (1) and return.
This only involve 2 floating point operation: one + and one ÷, and a bunch of integer arithmetics. The advantage over multiplications and divisions by powers of 10 is that the error won't accumulate unnecessarily.
(To read 16-digit integers you need a 64-bit int.)
In reality, you should use sscanf(str, "%lf", ...), std::istringstream, or boost::lexical_cast<double>.

go over the number digit by digit by using a bunch of multiplications and divisions by powers of 10 and construct the string character by character.

If you just want an idea of how to do it, the other answer, if you want an accurate result, the problem is not so simple and you should refer to the literature on the subject. An example: ftp://ftp.ccs.neu.edu/pub/people/will/howtoread.ps

I'm pretty sure that the Plauger Standard C Library book has a disc with the source of strtod.
http://www.amazon.co.uk/Standard-C-Library-P-J-Plauger/dp/0131315099
and there are online versions too:
http://www.google.co.uk/search?hl=en&client=firefox-a&hs=IvI&rls=org.mozilla%3Aen-GB%3Aofficial&q=strtod+source+code

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js