Choosing between float and double

Choosing between float and double - c++

The background:
I have been working on the following problem, "The Trip" from "Programming Challenges: The Programming Contest Training Manual" by S. Skiena:
A group of students are members of a club that travels annually to
different locations. Their destinations in the past have included
Indianapolis, Phoenix, Nashville, Philadelphia, San Jose, and Atlanta.
This spring they are planning a trip to Eindhoven.
The group agrees in advance to share expenses equally, but it is not
practical to share every expense as it occurs. Thus individuals in the
group pay for particular things, such as meals, hotels, taxi rides,
and plane tickets. After the trip, each student's expenses are tallied
and money is exchanged so that the net cost to each is the same, to
within one cent. In the past, this money exchange has been tedious and
time consuming. Your job is to compute, from a list of expenses, the
minimum amount of money that must change hands in order to equalize
(within one cent) all the students' costs.
Input
Standard input will contain the information for several trips. Each
trip consists of a line containing a positive integer n denoting the
number of students on the trip. This is followed by n lines of input,
each containing the amount spent by a student in dollars and cents.
There are no more than 1000 students and no student spent more than
$10,000.00. A single line containing 0 follows the information for the
last trip.
Output
For each trip, output a line stating the total amount of money, in
dollars and cents, that must be exchanged to equalize the students'
costs.
(Bold is mine, book here, site here)
I solved the problem with the following code:
/*
* the-trip.cpp
*/
#include <iostream>
#include <iomanip>
#include <cmath>
int main( int argc, char * argv[] )
{
int students_number, transaction_cents;
double expenses[1000], total, average, given_change, taken_change, minimum_change;
while (std::cin >> students_number) {
if (students_number == 0) {
return 0;
}
total = 0;
for (int i=0; i<students_number; i++) {
std::cin >> expenses[i];
total += expenses[i];
}
average = total / students_number;
given_change = 0;
taken_change = 0;
for (int i=0; i<students_number; i++) {
if (average > expenses[i]) {
given_change += std::floor((average - expenses[i]) * 100) / 100;
}
if (average < expenses[i]) {
taken_change += std::floor((expenses[i] - average) * 100) / 100;
}
}
minimum_change = given_change > taken_change ? given_change : taken_change;
std::cout << "$" << std::setprecision(2) << std::fixed << minimum_change << std::endl;
}
return 0;
}
My original implementation had float instead of double. It was working with the small problem instances provided with the description and I spent a lot of time trying to figure out what was wrong.
In the end I figured out that I had to use double precision, apparently some big input in the programming challenge tests made my algorithms with float fail.
The question:
Given the input can have 1000 students and each student can spend up to 10,000$, my total variable has to store a number of the maximum size of
10,000,000.
How should I decide which precision is needed?
Is there something that should have given me an hint that float wasn't enough for this task?
I later realized that in this case I could have avoided floating point at all since my number fits into integer types, but I'm still interested in understanding if there was a way to foresee that float wasn't precise enough in this case.

Is there something that should have given me an hint that float wasn't enough for this task?
The fact that 0.10 is not representable at all in binary floating-point (which both float and double are if you use an ordinary computer) should have been the hint. Binary floating-point is perfect for physical quantities that arrive inaccurate to begin with, or for computations that will be inaccurate anyway whatever the reasonable numerical system with decidable equality. Exact computations of monetary amounts are not a good application of binary floating-point.
How should I decide which precision is needed? … my total variable has to store a number of the maximum size of 10,000,000.
Use an integer type to represent numbers of cents. By your own reasoning, you shouldn't have to deal with amounts of more than 1,000,000,000 cents, so long should be enough, but just use long long and save yourself the risk of trouble with corner cases.

As you said: Never use floating point variables to represent money. Using an integer representation - either as one large number in form of cents or whatever the fraction of the local currency is, or as two numbers [which makes the math a bit more awkward, but easier to see/read/write the value as two units].
The motivation for not using floating point is that it's "often not accurate". Just like 1/3 can't be writen as an exact value using decimal representation, no matter how many threes you write, the actual answer would have more threes, binary floating point values can not precisely describe some decimal values, and you get "Your value of 0.20 does not match 0.20 that the customer owes" - which doesn't make sense, but that's because "0.200000000001" and "0.19999999999" aren't exactly the same thing according to the computer. And eventually, those little rounding errors will cause some big problem in one way or another - and this regardless of if it's float, double or extra_super_long_double.
However, if you have a question like this: if I have to represent a value of 10 million, with a precison of 1/100th of the unit, how big a floating point variable do I need, your calculation becomes:
float bigNumber = 10000000;
float smallNumber = 0.01;
float bits = log2(bigNumber/smallNumber);
cout << "Bits in mantissa needed: " << ceil(bits) << endl;
So, in this case, we get bits as 29.897, so you need 30 bits (in other words, float is not good enough.
Of course, if you do not need fractions of a dollar (or whatever), you can get away with a few less digits. Namely log2(10000000) = 23.2 - so 24 bits of mantissa -> still too big for a float.

10,000,000>2^23 so you need at least 24 bits of mantissa, which is what single precision provides. Because of intermediate rounding, the last bits can err.
1 digit ~ 3.321928 bits.

Related

Loss of precision with pow function when surpassing 10^10 limit?

Doing one of my first homeworks of uni, and have ran into this problem:
Task: Find a sum of all n elements where n is the count of numerals in a number (n=1, means 1, 2, 3... 8, 9 for example, answer is 45)
Problem: The code I wrote has gotten all the test answers correctly up to 10 to the power of 9, but when it reaches 10 to the power of 10 territory, then the answers start being wrong, it's really close to what I should be getting, but not quite there (For example, my output = 49499999995499995136, expected result = 49499999995500000000)
Would really appreciate some help/insights, am guessing it's something to do with the variable types, but not quite sure of a possible solution..
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
int main()
{
int n;
double ats = 0, maxi, mini;
cin >> n;
maxi = pow(10, n) - 1;
mini = pow(10, n-1) - 1;
ats = (maxi * (maxi + 1)) / 2 - (mini * (mini + 1)) / 2;
cout << setprecision(0) << fixed << ats;
}

The main reason of problems is pow() function. It works with double, not int. Loss of accuracy is price for representing huge numbers.
There are 3 way's to solve problem:
For small n you can make your own long long int pow(int x, int pow) function. But there is problem, that we can overflow even long long int
Use long arithmetic functions, as #rustyx sayed. You can write your own with vector, or find and include library.
There is Math solution specific for topic's task. It solves the big numbers problem.
You can write your formula like
((10^n) - 1) * (10^n) - (10^m - 1) * (10^m)) / 2 , (here m = n-1)
Then multiply numbers in numerator. Regroup them. Extract common multiples 10^(n-1). And then you can see, that answer have a structure:
X9...9Y0...0 for big enought n, where letter X and Y are constants.
So, you can just print the answer "string" without calculating.

I think you're stretching floating points beyond their precision. Let me explain:
The C pow() function takes doubles as arguments. You're passing ints, the compiler is adding the code to convert them to doubles before they reach pow(). (And anyway you're storing it as a double when you get the return value since you declared it that way).
Floating points are called that way precisely because the point "floats". Inside a double there's a sign bit, a few bits for the mantissa and a few bits for the exponent. In binary, elevating to a power of two is equivalent to moving the fractional point to the right (or to the left if you're elevating to a negative number). So basically the exponent is saying where the fractional point is, in binary. The great advantage of using this kind of in-memory representation for doubles is that you get a lot of precision for numbers close to 0, and gradually lose precision as numbers become bigger.
That last thing is exactly what's happening to you. Your number is too large to be stored exactly. So it's being rounded to the closest sum of powers of two (powers of two are the numbers that have all zeroes to the right in binary).
Quick experiment: press F12 in your browser, open the javascript console and type 49499999995499995136. In my case, in chrome, I reproduce the same problem.
If you really really really want precision with such big numbers then you can try some of these libraries, but that's too advanced for a student program, you don't need it. Just add an if block and print an error message if the number that the user typed is too big (professors love that, which is actually quite correct).

Errors in Casting Doubles to Integers [duplicate]

This question already has answers here:
Round a float to a regular grid of predefined points
(11 answers)
Closed 4 years ago.
I am calculating the number of significant numbers past the decimal point. My program discards any numbers that are spaced more than 7 orders of magnitude apart after the decimal point. Expecting some error with doubles, I accounted for very small numbers popping up when subtracting ints from doubles, even when it looked like it should equal zero (To my knowledge this is due to how computers store and compute their numbers). My confusion is why my program does not handle this unexpected number given this random test value.
Having put many cout statements it would seem that it messes up when it tries to cast the final 2. Whenever it casts it casts to 1 instead.
bool flag = true;
long double test = 2029.00012;
int count = 0;
while(flag)
{
test = test - static_cast<int>(test);
if(test <= 0.00001)
{
flag = false;
}
test *= 10;
count++;
}
The solution I found was to cast only once at the beginning, as rounding may produce a negative and terminate prematurely, and to round thenceforth. The interesting thing is that both trunc and floor also had this issue, seemingly turning what should be a 2 into a 1.
My Professor and I were both quite stumped as I fully expected small numbers to appear (most were in the 10^-10 range), but was not expecting that casting, truncing, and flooring would all also fail.

It is important to understand that not all rational numbers are representable in finite precision. Also, it is important to understand that set of numbers which are representable in finite precision in decimal base, is different from the set of numbers that are representable in finite precision in binary base. Finally, it is important to understand that your CPU probably represents floating point numbers in binary.
2029.00012 in particular happens to be a number that is not representable in a double precision IEEE 754 floating point (and it indeed is a double precision literal; you may have intended to use long double instead). It so happens that the closest number that is representable is 2029.000119999999924402800388634204864501953125. So, you're counting the significant digits of that number, not the digits of the literal that you used.
If the intention of 0.00001 was to stop counting digits when the number is close to a whole number, it is not sufficient to check whether the value is less than the threshold, but also whether it is greater than 1 - threshold, as the representation error can go either way:
if(test <= 0.00001 || test >= 1 - 0.00001)
After all, you can multiple 0.99999999999999999999999999 with 10 many times until the result becomes close to zero, even though that number is very close to a whole number.

As multiple people have already commented, that won't work because of limitations of floating-point numbers. You had a somewhat correct intuition when you said that you expected "some error" with doubles, but that is ultimately not enough. Running your specific program on my machine, the closest representable double to 2029.00012 is 2029.0001199999999244 (this is actually a truncated value, but it shows the series of 9's well enough). For that reason, when you multiply by 10, you keep finding new significant digits.
Ultimately, the issue is that you are manipulating a base-2 real number like it's a base-10 number. This is actually quite difficult. The most notorious use cases for this are printing and parsing floating-point numbers, and a lot of sweat and blood went into that. For example, it wasn't that long ago that you could trick the official Java implementation into looping endlessly trying to convert a String to a double.
Your best shot might be to just reuse all that hard work. Print to 7 digits of precision, and subtract the number of trailing zeroes from the result:
#include <iostream>
#include <sstream>
#include <iomanip>
#include <string>
int main() {
long double d = 2029.00012;
auto double_string = (std::stringstream() << std::fixed << std::setprecision(7) << d).str();
auto first_decimal_index = double_string.find('.') + 1;
auto last_nonzero_index = double_string.find_last_not_of('0');
if (last_nonzero_index == std::string::npos) {
std::cout << "7 significant digits\n";
} else if (last_nonzero_index < first_decimal_index) {
std::cout << -(first_decimal_index - last_nonzero_index + 1) << " significant digits\n";
} else {
std::cout << (last_nonzero_index - first_decimal_index) << " significant digits\n";
}
}
It feels unsatisfactory, but:
it correctly prints 5;
the "satisfactory" alternative is possibly significantly harder to implement.
It seems to me that your second-best alternative is to read on floating-point printing algorithms and implement just enough of it to get the length of the value that you're going to print, and that's not exactly an introductory-level task. If you decide to go this route, the current state of the art is the Grisu2 algorithm. Grisu2 has the notable benefit that it will always print the shortest base-10 string that will produce the given floating-point value, which is what you seem to be after.

If you want sane results, you can't just truncate the digits, because sometimes the floating point number will be a hair less than the rounded number. If you want to fix this via a fluke, change your initialization to be
long double test = 2029.00012L;
If you want to fix it for real,
bool flag = true;
long double test = 2029.00012;
int count = 0;
while (flag)
{
test = test - static_cast<int>(test + 0.000005);
if (test <= 0.00001)
{
flag = false;
}
test *= 10;
count++;
}
My apologies for butchering your haphazard indent; I can't abide by them. According to one of my CS professors, "ideally, a computer scientist never has to worry about the underlying hardware." I'd guess your CS professor might have similar thoughts.

Count number of ways to make an amount with change given

I was trying to code an algorithm to count the number of different possible ways the make a certain amount with the given denominations.
Assume the US dollar is available in denominations of $100, $50, $20, $10, $5, $1, $0.25, $0.10, $0.05 and $0.01. Below function, works great for int amount and int denominations
/* Count number of ways of making different combination */
int Count_Num_Ways(double amt, int numDenom, double S[]){
cout << amt << endl; //getchar();
/* combination leads to the amount */
if(amt == 0.00)
return 1;
/* No combination can lead to negative sum*/
if(amt < 0.00)
return 0;
/* All denominations have been exhausted and we have not reached
the required sum */
if(numDenom < 0 && amt >= 0.00)
return 0;
/* either we pick this denomination, this causes a reduction of
picked denomination from the sum for further subproblem, or
we choose to not pick this denomination and
try a different denomination */
return Count_Num_Ways(amt, numDenom - 1, S) +
Count_Num_Ways(amt - S[numDenom], numDenom, S);
}
but when I change my logic from int to float, it goes into infinite loop. I suspect that it is because of floating point comparisons in the code. I am not able to figure out the exact cause for a infinite loop behavior.
Any help in this regard would be helpful.

When dealing with such "small" currency amounts and not dealing with interest it will be much easier to just deal with cents and integer amounts only, not using floating point.
So just change your formula to use cents rather than dollars and keep using integers. Then when you need to display the amounts just divide them by 100 to get the dollars and modulo 100 to get the cents.

floating point operations cannot be exact, because of the finite representation. This way you will never ever end up with exactly 0.0. That's why you always test an interval like so:
if (fabs(amt - 0.0) < TOL)
with a given tolerance of TOL. TOL is chosen appropriately for the application, in this case, 1/2 cent should already be fine.
EDIT: Of course, for this kind of thing, Daemin's answer is much more suitable.

Why do simple doubles like 1.82 end up being 1.819999999645634565360? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why does Visual Studio 2008 tell me .9 - .8999999999999995 = 0.00000000000000055511151231257827?
c++
Hey so i'm making a function to return the number of a digits in a number data type given, but i'm having some trouble with doubles.
I figure out how many digits are in it by multiplying it by like 10 billion and then taking away digits 1 by 1 until the double ends up being 0. however when putting in a double of value say .7904 i never exit the function as it keeps taking away digits which never end up being 0 as the resut of .7904 ends up being 7,903,999,988 and not 7,904,000,000.
How can i solve this problem?? Thanks =) ! oh and any other feed back on my code is WELCOME!
here's the code of my function:
/////////////////////// Numb_Digits() ////////////////////////////////////////////////////
enum{DECIMALS = 10, WHOLE_NUMBS = 20, ALL = 30};
template<typename T>
unsigned long int Numb_Digits(T numb, int scope)
{
unsigned long int length= 0;
switch(scope){
case DECIMALS: numb-= (int)numb; numb*=10000000000; // 10 bil (10 zeros)
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case WHOLE_NUMBS: numb= (int)numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
case ALL: numb = numb; numb*=10000000000;
for(; numb != 0; length++)
numb-=((int)(numb/pow((double)10, (double)(9-length))))* pow((double)10, (double)(9-length)); break;
default: break;}
return length;
};
int main()
{
double test = 345.6457;
cout << Numb_Digits(test, ALL) << endl;
cout << Numb_Digits(test, DECIMALS) << endl;
cout << Numb_Digits(test, WHOLE_NUMBS) << endl;
return 0;
}

It's because of their binary representation, which is discussed in depth here:
http://en.wikipedia.org/wiki/IEEE_754-2008
Basically, when a number can't be represented as is, an approximation is used instead.
To compare floats for equality, check if their difference is lesser than an arbitrary precision.

The easy summary about floating point arithmetic :
http://floating-point-gui.de/
Read this and you'll see the light.
If you're more on the math side, Goldberg paper is always nice :
http://cr.yp.to/2005-590/goldberg.pdf
Long story short : real numbers are stored with a fixed, irregular precision, leading to non obvious behaviors. This is unrelated to the language but more a design choice of how to handle real numbers as a whole.

This is because C++ (like most other languages) can not store floating point numbers with infinte precision.
Floating points are stored like this:
sign * coefficient * 10^exponent if you're using base 10.
The problem is that both the coefficient and exponent are stored as finite integers.
This is a common problem with storing floating point in computer programs, you usually get a tiny rounding error.
The most common way of dealing with this is:
Store the number as a fraction (x/y)
Use a delta that allows small deviations (if abs(x-y) < delta)
Use a third party library such as GMP that can store floating point with perfect precision.
Regarding your question about counting decimals.
There is no way of dealing with this if you get a double as input. You cannot be sure that the user actually sent 1.819999999645634565360 and not 1.82.
Either you have to change your input or change the way your function works.
More info on floating point can be found here: http://en.wikipedia.org/wiki/Floating_point

This is because of the way the IEEE floating point standard is implemented, which will vary depending on operations. It is an approximation of precision. Never use logic of if(float == float), ever!

Float numbers are represented in the form Significant digits × baseexponent(IEEE 754). In your case, float 1.82 = 1 + 0.5 + 0.25 + 0.0625 + ...
Since only a limited digits could be stored, therefore there will be a round error if the float number cannot be represented as a terminating expansion in the relevant base (base 2 in the case).

You should always check relative differences with floating point numbers, not absolute values.
You need to read this, too.

Computers don't store floating point numbers exactly. To accomplish what you are doing, you could store the original input as a string, and count the number of characters.

C/C++ rounding up decimals with a certain precision, efficiently

I'm trying to optimize the following. The code bellow does this :
If a = 0.775 and I need precision 2 dp then a => 0.78
Basically, if the last digit is 5, it rounds upwards the next digit, otherwise it doesn't.
My problem was that 0.45 doesnt round to 0.5 with 1 decimalpoint, as the value is saved as 0.44999999343.... and setprecision rounds it to 0.4.
Thats why setprecision is forced to be higher setprecision(p+10) and then if it really ends in a 5, add the small amount in order to round up correctly.
Once done, it compares a with string b and returns the result. The problem is, this function is called a few billion times, making the program craw. Any better ideas on how to rewrite / optimize this and what functions in the code are so heavy on the machine?
bool match(double a,string b,int p) { //p = precision no greater than 7dp
double t[] = {0.2, 0.02, 0.002, 0.0002, 0.00002, 0.000002, 0.0000002, 0.00000002};
stringstream buff;
string temp;
buff << setprecision(p+10) << setiosflags(ios_base::fixed) << a; // 10 decimal precision
buff >> temp;
if(temp[temp.size()-10] == '5') a += t[p]; // help to round upwards
ostringstream test;
test << setprecision(p) << setiosflags(ios_base::fixed) << a;
temp = test.str();
if(b.compare(temp) == 0) return true;
return false;
}

I wrote an integer square root subroutine with nothing more than a couple dozen lines of ASM, with no API calls whatsoever - and it still could only do about 50 million SqRoots/second (this was about five years ago ...).
The point I'm making is that if you're going for billions of calls, even today's technology is going to choke.
But if you really want to make an effort to speed it up, remove as many API usages as humanly possible. This may require you to perform API tasks manually, instead of letting the libraries do it for you. Specifically, remove any type of stream operation. Those are slower than dirt in this context. You may really have to improvise there.
The only thing left to do after that is to replace as many lines of C++ as you can with custom ASM - but you'll have to be a perfectionist about it. Make sure you are taking full advantage of every CPU cycle and register - as well as every byte of CPU cache and stack space.
You may consider using integer values instead of floating-points, as these are far more ASM-friendly and much more efficient. You'd have to multiply the number by 10^7 (or 10^p, depending on how you decide to form your logic) to move the decimal all the way over to the right. Then you could safely convert the floating-point into a basic integer.
You'll have to rely on the computer hardware to do the rest.
<--Microsoft Specific-->
I'll also add that C++ identifiers (including static ones, as Donnie DeBoer mentioned) are directly accessible from ASM blocks nested into your C++ code. This makes inline ASM a breeze.
<--End Microsoft Specific-->

Depending on what you want the numbers for, you might want to use fixed point numbers instead of floating point. A quick search turns up this.

I think you can just add 0.005 for precision to hundredths, 0.0005 for thousands, etc. snprintf the result with something like "%1.2f" (hundredths, 1.3f thousandths, etc.) and compare the strings. You should be able to table-ize or parameterize this logic.

You could save some major cycles in your posted code by just making that double t[] static, so that it's not allocating it over and over.

Try this instead:
#include <cmath>
double setprecision(double x, int prec) {
return
ceil( x * pow(10,(double)prec) - .4999999999999)
/ pow(10,(double)prec);
}
It's probably faster. Maybe try inlining it as well, but that might hurt if it doesn't help.
Example of how it works:
2.345* 100 (10 to the 2nd power) = 234.5
234.5 - .4999999999999 = 234.0000000000001
ceil( 234.0000000000001 ) = 235
235 / 100 (10 to the 2nd power) = 2.35
The .4999999999999 was chosen because of the precision for a c++ double on a 32 bit system. If you're on a 64 bit platform you'll probably need more nines. If you increase the nines further on a 32 bit system it overflows and rounds down instead of up, i. e. 234.00000000000001 gets truncated to 234 in a double in (my) 32 bit environment.

Using floating point (an inexact representation) means you've lost some information about the true number. You can't simply "fix" the value stored in the double by adding a fudge value. That might fix certain cases (like .45), but it will break other cases. You'll end up rounding up numbers that should have been rounded down.
Here's a related article:
http://www.theregister.co.uk/2006/08/12/floating_point_approximation/

I'm taking at guess at what you really mean to do. I suspect you're trying to see if a string contains a decimal representation of a double to some precision. Perhaps it's an arithmetic quiz program and you're trying to see if the user's response is "close enough" to the real answer. If that's the case, then it may be simpler to convert the string to a double and see if the absolute value of the difference between the two doubles is within some tolerance.
double string_to_double(const std::string &s)
{
std::stringstream buffer(s);
double d = 0.0;
buffer >> d;
return d;
}
bool match(const std::string &guess, double answer, int precision)
{
const static double thresh[] = { 0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double g = string_to_double(guess);
const double delta = g - answer;
return -thresh[precision] < delta && delta <= thresh[precision];
}
Another possibility is to round the answer first (while it's still numeric) BEFORE converting it to a string.
bool match2(const std::string &guess, double answer, int precision)
{
const static double thresh[] = {0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double rounded = answer + thresh[precision];
std::stringstream buffer;
buffer << std::setprecision(precision) << rounded;
return guess == buffer.str();
}
Both of these solutions should be faster than your sample code, but I'm not sure if they do what you really want.

As far as i see you are checking if a rounded on p points is equal b.
Insted of changing a to string, make other way and change string to double
- (just multiplications and addion or only additoins using small table)
- then substract both numbers and check if substraction is in proper range (if p==1 => abs(p-a) < 0.05)

Old time developers trick from the dark ages of Pounds, Shilling and pence in the old country.
The trick was to store the value as a whole number fo half-pennys. (Or whatever your smallest unit is). Then all your subsequent arithmatic is straightforward integer arithimatic and rounding etc will take care of itself.
So in your case you store your data in units of 200ths of whatever you are counting,
do simple integer calculations on these values and divide by 200 into a float varaible whenever you want to display the result.
I beleive Boost does a "BigDecimal" library these days, but, your requirement for run time speed would probably exclude this otherwise excellent solution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js