I was just reading about rounding errors in C++. So, if I'm making a math intense program (or any important calculations) should I just drop floats all together and use only doubles or is there an easier way to prevent rounding errors?
Obligatory lecture: What Every Programmer Should Know About Floating-Point Arithmetic.
Also, try reading IEEE Floating Point standard.
You'll always get rounding errors. Unless you use an infinite arbitrary precision library, like gmplib. You have to decide if your application really needs this kind of effort.
Or, you could use integer arithmetic, converting to floats only when needed. This is still hard to do, you have to decide if it's worth it.
Lastly, you can use float or double taking care not to make assumption about values at the limit of representation's precision. I'd wish this Valgrind plugin was implemented (grep for float)...
The rounding errors are normally very insignificant, even using floats. Mathematically-intense programs like games, which do very large numbers of floating-point computations, often still use single-precision.
This might work if your highest number is less than 10 billion and you're using C++ double precision.
if ( ceil(10000*(x + 0.00001)) > ceil(100000*(x - 0.00001))) {
x = ceil(10000*(x + 0.00004)) / 10000;
}
This should allow at least the last digit to be off +/- 9. I'm assuming dividing by 1000 will always just move a decimal place. If not, then maybe it could be done in binary.
You would have to apply it after every operation that is not +, -, *, or a comparison. For example, you can't do two divisions in the same formula because you'd have to apply it to each division.
If that doesn't work, you could work in integers by scaling the numbers up and always use integer division. If you need advanced functions maybe there is a package that does deterministic integer math. Integer division is required in a lot of financial settings because of round off error being subject to exploit like in the movie "The Office".
Related
I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.
Questions like "Why isn't 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1 = 0.8?" got me thinking that...
... It would probably be nice to have the compiler warn about the floating-point constants that it rounds to the nearest representable in the binary floating-point type (e.g. 0.1 and 0.8 are rounded in radix-2 floating-point, otherwise they'd need an infinite amount of space to store the infinite number of digits).
I've looked up gcc warnings and so far found none for this purpose (-Wall, -Wextra, -Wfloat-equal, -Wconversion, -Wcoercion (unsupported or C only?), -Wtraditional (C only) don't appear to be doing what I want).
I haven't found such a warning in Microsoft Visual C++ compiler either.
Am I missing a hidden or rarely-used option?
Is there any compiler at all that has this kind of warning?
EDIT: This warning could be useful for educational purposes and serve as a reminder to those new to floating-point.
There is no technical reason the compiler could not issue such warnings. However, they would be useful only for students (who ought to be taught how floating-point arithmetic works before they start doing any serious work with it) and people who do very fine work with floating-point. Unfortunately, most floating-point work is rough; people throw numbers at the computer without much regard for how the computer works, and they accept whatever results they get.
The warning would have to be off by default to support the bulk of existing floating-point code. Were it available, I would turn it on for my code in the Mac OS X math library. Certainly there are points in the library where we depend on every bit of the floating-point value, such as places where we use extended-precision arithmetic, and values are represented across more than one floating-point object (e.g., we would have one object with the high bits of 1/π, another object with 1/π minus the first object, and a third object with 1/π minus the first two objects, giving us about 150 bits of 1/π). Some such values are represented in hexadecimal floating-point in the source text, to avoid any issues with compiler conversion of decimal numerals, and we could readily convert any remaining numerals to avoid the new compiler warning.
However, I doubt we could convince the compiler developers that enough people would use this warning or that it would catch enough bugs to make it worth their time. Consider the case of libm. Suppose we generally wrote exact numerals for all constants but, on one occasion, wrote some other numeral. Would this warning catch a bug? Well, what bug is there? Most likely, the numeral is converted to exactly the value we wanted anyway. When writing code with this warning turned on, we are likely thinking about how the floating-point calculations will be performed, and the value we have written is one that is suitable for our purpose. E.g., it may be a coefficient of some minimax polynomial we calculated, and the coefficient is as good as it is going to get, whether represented approximately in decimal or converted to some exactly-representable hexadecimal floating-point numeral.
So, this warning will rarely catch bugs. Perhaps it would catch an occasion where we mistyped a numeral, accidentally inserting an extra digit into a hexadecimal floating-point numeral, causing it to extend beyond the representable significand. But that is rare. In most cases, the numerals we use are either simple and short or are copied and pasted from software that has calculated them. On some occasions, we will hand-type special values, such as 0x1.fffffffffffffp0. A warning when an extra “f” slips into that numeral might catch a bug during compilation, but that error would almost certainly be caught quickly in testing, since it drastically alters the special value.
So, such a compiler warning has little utility: Very few people will use it, and it will catch very few bugs for the people who do use it.
The warning is in the source: when you write float, double, or long double including any of their respective literals. Obviously, some literals are exact but even this doesn't help much: the sum of two exact values may inexact, e.g., if the have rather different scales. Having the compiler warn about inexact floating point constants would generate a false sense of security. Also, what are you meant to do about rounded constants? Writing the exact closest value explicitly would be error prone and obfuscate the intent. Writing them differently, e.g., writing 1.0 / 10.0 instead of 0.1 also obfuscates the intent and could yield different values.
There will be no such compiler switch and the reason is obvious.
We are writing down the binary components in decimal:
First fractional bit is 0.5
Second fractional bit is 0.25
Third fractional bit is 0.125
....
Do you see it ? Due to the odd endings with the number 5 every bit needs
another decimal to represent it exactly. One bit needs one decimal, two bits
needs two decimals and so on.
So for fractional floating points it would mean that for most decimal numbers
you need 24(!) decimal digits for single precision floats and
53(!!) decimal digits for double precision.
Worse, the exact digits carry no extra information, they are pure artifacts
caused by the base change.
Noone is going to write down 3.141592653589793115997963468544185161590576171875
for pi to avoid a compiler warning.
I don't see how a compiler would know or that the compiler can warn you about something like that. It is only a coincidence that a number can be exactly represented by something that is inherently inaccurate.
Since there are two ways of implementing an AP fractional number, one is to emulate the storage and behavior of the double data type, only with more bytes, and the other is to use an existing integer APA implementation for representing a fractional number as a rational i.e. as a pair of integers, numerator and denominator, which of the two ways are more likely to deliver efficient arithmetic in terms of performance? (Memory usage is really of minor concern.)
I'm aware of the existing C/C++ libraries, some of which offer fractional APA with "floats" and other with rationals (none of them features fixed-point APA, however) and of course I could benchmark a library that relies on "float" implementation against one that makes use of rational implementation, but the results would largely depend on implementation details of those particular libraries I would have to choose randomly from the nearly ten available ones. So it's more theoretical pros and cons of the two approaches that I'm interested in (or three if take into consideration fixed-point APA).
The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?
In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).
Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.
Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.
Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).
So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?
Since no one else seemed to mention this, rationals and floats represent different sets of numbers. The value 1/3 can be represented precisely with a rational, but not a float. Even an arbitrary precision float would take infinitely many mantissa bits to represent a repeating decimal like 1/3. This is because a float is effectively like a rational but where the denominator is constrained to be a power of 2. An arbitrary precision rational can represent everything that an arbitrary precision float can and more, because the denominator can be any integer instead of just powers of 2. (That is, unless I've horribly misunderstood how arbitrary precision floats are implemented.)
This is in response to your prompt for theoretical pros and cons.
I know you didn't ask about memory usage, but here's a theoretical comparison in case anyone else is interested. Rationals, as mentioned above, specialize in numbers that can be represented simply in fractional notation, like 1/3 or 492113/203233, and floats specialize in numbers that are simple to represent in scientific notation with powers of 2, like 5*2^45 or 91537*2^203233. The amount of ascii typing needed to represent the numbers in their respective human-readable form is proportional to their memory usage.
Please correct me in the comments if I've gotten any of this wrong.
Either way, you'll need multiplication of arbitrary size integers. This will be the dominant factor in your performance since its complexity is worse than O(n*log(n)). Things like aligning operands, and adding or subtracting large integers is O(n), so we'll neglect those.
For simple addition and subtraction, you need no multiplications for floats* and 3 multiplications for rationals. Floats win hands down.
For multiplication, you need one multiplication for floats and 2 multiplications for rational numbers. Floats have the edge.
Division is a little bit more complex, and rationals might win out here, but it's by no means a certainty. I'd say it's a draw.
So overall, IMHO, the fact that addition is at least O(n*log(n)) for rationals and O(n) for floats clearly gives the win to a floating-point representation.
*It is possible that you might need one multiplication to perform addition if your exponent base and your digit base are different. Otherwise, if you use a power of 2 as your base, then aligning the operands takes a bit shift. If you don't use a power of two, then you may also have to do a multiplication by a single digit, which is also an O(n) operation.
You are effectively asking the question: "I need to participate in a race with my chosen animal. Should I choose a turtle or a snail ?".
The first proposal "emulating double" sounds like staggered precision: using an array of doubles of which the sum is the defined number. There is a paper from Douglas M. Priest "Algorithms for Arbitrary Precision Floating Point Arithmetic" which describes how to implement this arithmetic. I implemented this and my experience is very bad: The necessary overhead to make this run drops the performance 100-1000 times !
The other method of using fractionals has severe disadvantages, too: You need to implement gcd and kgv and unfortunately every prime in your numerator or denominator has a good chance to blow up your numbers and kill your performance.
So from my experience they are the worst choices one can made for performance.
I recommend the use of the MPFR library which is one of the fastest AP packages in C and C++.
Rational numbers don't give arbitrary precision, but rather the exact answer. They are, however, more expensive in terms of storage and certain operations with them become costly and some operations are not allowed at all, e.g. taking square roots, since they do not necessarily yield a rational answer.
Personally, I think in your case AP floats would be more appropriate.
I want to divide two ull variables and get the most accurate outcome.
what is the best way to do that?
i.e. 5000034 / 5000000 = 1.0000068
If you want "most accurate precision" - you should avoid floating point arithmetics.
You might want to use some big decimal library [whcih usually implements fixed point arithmetic], and will allow you to define the precision you are seeking.
You should avoid floating point arithmetic because thet are not exact [you have finite number of bits to represent infinite number of numbers in every range, so some slicing must occure...]. Fixed point arithmetic [as usually implemented in big decimal libraries] allows you to allocate more bits "on the fly" to represent the number in the desired accuracy.
More info on the floating point issue can be found in this [a bit advanced] article: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Instead of (double)(N) / D, do 1 + ( (double)(N - D) / D)
I'm afraid that “the most accurate outcome” doesn't mean
much. No finite representation can represent all real numbers exactly;
how precise the representation can be depends on the size of the type
and its internal representation. On most implementations, double will
give about 17 decimal digits precision, which is usually several orders
more precise than the input; for a single multiplicatio or division,
double is usually fine. (Problems occur with addition and subtraction
when the difference between the two values is extreme.) There exist
packages which offer larger precision (BigDecimal, BigFloat and the
like), but they are never exact: in the end, the precision is limited by
the amount of memory you're willing to let them use. They're also much
slower than double, and generally (slightly) more difficult to use
correctly (since they have more options, e.g. just how much precision do
you want). The only real answer to your question is another question:
how much precision do you need? And for what sequence of operations?
Rounding errors accumulate, so while double may be largely sufficient
for a single division, it may cause problems if used naïvely for
iterative procedures. Although in such cases, the solution isn't
usually to increase the precision, but to change the algorithm in a way
to avoid the problems. If double gives you the precision you need,
use it in preference to any extended type. If it doesn't, and you don't
have a choice, then choose one of the existing arbitrary precision
libraries, such as GMP.
(You might also have an issue with the way rounding is handled. For
bookkeeping purposes, for example, most jurisdictions have very strict
laws concerning how to round monitary values, and their rules are based
on decimal arithmetic. In such cases, you'll need a numeric type which
does decimal arithmetic in order for the rounding to conform in all
cases.)
Floating point numbers are probably most accurate for multiplication and division, while integers and fixed point numbers are the best choice for addition and subtraction. This follows from the fact that multiplication and division changes the order of magnitude which floating point numbers handle better, while addition and subtraction is some kind of step, which integers and fixed point numbers handle better.
If you want the best accuracy when dividing integers, implement a RationalNumber class containing the numerator and denominator. This way your reslut will always be exact if you avoid arithmetic overflow. This requires that you accept output in fractional form.
I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.