Doubles are contaminating my BigDecimal math

Doubles are contaminating my BigDecimal math - clojure

I'm writing a Mandelbrot Set explorer. I need as much precision as possible so I can zoom in as far as possible.
I noticed an unfortunate side-effect of mixing doubles and BigDecimals: they "contaminate" the type returned:
(type (* 1M 2))
=> java.math.BigDecimal
(type (* 1M 2.0))
=> java.lang.Double
I expected the opposite. BigDecimals, being potentially more precise, should contaminate the doubles.
Besides manually calling bigdec on every number that may come in contact with a BigDecimal, is there a way of preventing the auto-downgrade to double when doing math on doubles and BigDecimals?

Once you introduce a double into the equation, you limit the amount of precision you can possibly have. A BigDecimal accurate to within a million decimal places is no use to you, if the way you got it involved multiplying by something with just 15 or so significant digits. You could promote the result to a BigDecimal, but you've lost a ton of precision whether you like it or not. Therefore, Clojure's promotion rules make that obvious for you, by giving back a double instead of a high-precision BigDecimal.
See, for example, BigDecimal's JavaDoc for an explanation of why it is a bad idea to convert doubles to BigDecimals, implicitly or explicitly.

This isn't actually a bug, even though it at least looks wrong. To more clearly show how this leads to wrong looking answers compare these expressions:
user> (* 2.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001M 1.0)
2.0
user> (* 2.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001M 1.0M)
2.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010M
For the time being you will likely need to, as you suggest, make sure you only use big decimals in your program. It will likely be limited to the IO functions and any constants you introduce need the M on the end. Adding preconditions to functions will likely help catch some cases as well.

Related

Convert any number in clojure to floating point number

So I have this problem: I am making calculations with each element in a vector. Depending on the numbers in the vector the resulting vector may contain rationals, floats, scientific floats, big integers, integers. I need to convert all of them to floating point numbers or rounded integers. The resulting text is an SVG that will be sent to the client and those numbers are part of a path. Can I do it with something built in or should I roll my own function?
The problems with format are that it is a thin wrapper around the java Formatter class. This means that big ints are not handled since clojure has its own implementation.
On the other hand cl-format (which should be the primary choice actually) formats everything well except for rationals - 4/5 is converted to 4/5.0. Maybe I am doing something wrong with cl-format.
I tried type hinting the whole vectors (which will be needed anyway) as doubles, but the results keep returning as rationals.
Please help if I am missing something.

It is not the job of a formatter to convert your datatypes, you should ensure the data has the type you want to print before handing it to the formatter.
Type hinting is not a type conversion or coercion, it is a hint to the compiler about what args are most likely to come in at runtime.
The function you want is double.
user=> (double 42)
42.0
user=> (map double [1.0 1 1.1M 2/3])
(1.0 1.0 1.1 0.6666666666666667)
Also, if you are type hinting the entire vector, you should not be using a vector, you should be using double-array. Vectors are not specialized in any way by the type of their contents.

How can I get consistent program behavior when using floats?

I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.

I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.

Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.

I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.

If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).

Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.

Two ints to one double C++

I am having a bit of a problem here.
I have two int values, one for dollars and one for cents. My job is to combine them into one double value and I am having some trouble.
Here's an example of what I want to be able to do:
int dollars = 10
int cents = 50
<some code which I haven't figured out yet>
double total = 10.50
I want to think it is relatively simple, but I'm having a hard time figuring it out.
Thanks for the help!

Start by thinking how you would solve this as a simple arithmetic problem, with pencil and paper (nothing to do with C). Once you find a way to do it manually, I'm sure the way to program it will seem trivial.

How about double total = double(dollars) + double(cents) / 100.0;?
Note that double is not a good data type to represent 10-based currencies, due to its inability to represent 1/100 precisely. Consider a fixed-point solution instead, or perhaps a decimal float (those are rare).

That's not difficult... you have to convert dollars to a double1 and add cents multiplied for 0.01 (or divided by 100. - notice the trailing dot, that's to indicate that 100. is a double constant, so / will perform a floating-point division instead of an integer division).
... but be aware of the fact that storing monetary values in binary floating-point variables is not a good idea at all, because binary doesn't have a finite representation of many "exact" decimal amounts (e.g. 0.1), that will be stored in an approximate representation. Working with such values may yield "strange" results when you start to do some arithmetic with them.
Actually, depending on your expression, it's probably not necessary due to implicit casts.

If you're interested in 'the whole idea' of programming and not only in getting your homework right, I suggest you think about this: "Is there any way I can represent a whole dollar as a certain amount of cents?" Why should you ask this? Because if you want to represent two different 'types' of certain values as one value, you need to 'normalize' them or 'standardize' them in a way so that there is not any data loss or corruption (or at least for the smaller problems).
Also I agree with Kerrek SB, representing money as double might not be the best solution.

Isn't it just as easy: total = dollars + (cents/100); ?
No reason to complicate this.

Preventing Rounding Errors

I was just reading about rounding errors in C++. So, if I'm making a math intense program (or any important calculations) should I just drop floats all together and use only doubles or is there an easier way to prevent rounding errors?

Obligatory lecture: What Every Programmer Should Know About Floating-Point Arithmetic.
Also, try reading IEEE Floating Point standard.
You'll always get rounding errors. Unless you use an infinite arbitrary precision library, like gmplib. You have to decide if your application really needs this kind of effort.
Or, you could use integer arithmetic, converting to floats only when needed. This is still hard to do, you have to decide if it's worth it.
Lastly, you can use float or double taking care not to make assumption about values at the limit of representation's precision. I'd wish this Valgrind plugin was implemented (grep for float)...

The rounding errors are normally very insignificant, even using floats. Mathematically-intense programs like games, which do very large numbers of floating-point computations, often still use single-precision.

This might work if your highest number is less than 10 billion and you're using C++ double precision.
if ( ceil(10000*(x + 0.00001)) > ceil(100000*(x - 0.00001))) {
x = ceil(10000*(x + 0.00004)) / 10000;
}
This should allow at least the last digit to be off +/- 9. I'm assuming dividing by 1000 will always just move a decimal place. If not, then maybe it could be done in binary.
You would have to apply it after every operation that is not +, -, *, or a comparison. For example, you can't do two divisions in the same formula because you'd have to apply it to each division.
If that doesn't work, you could work in integers by scaling the numbers up and always use integer division. If you need advanced functions maybe there is a package that does deterministic integer math. Integer division is required in a lot of financial settings because of round off error being subject to exploit like in the movie "The Office".

Need pow(-1,1.2) to be 1

I am using math.h with GCC and GSL. I was wondering how to get this to evaluate?
I was hoping that the pow function would recognize pow(-1,1.2) as ((-1)^6)^(1/5). But it doesn't.
Does anybody know of a c++ library that will recognize these? Perhaps somebody has a decomposition routine they could share.

Mathematically, pow(-1, 1.2) is simply not defined. There are no powers with fractional exponents of negative numbers, and I hope there is no library that will simply return some arbitray value for such an expression. Would you also expect things like
pow(-1, 0.5) = ((-1)^2)^(1/4) = 1
which obviously isn't desirable.
Moreover, the floating point number 1.2 isn't even exactly equal to 6/5. The closest double precision number to 1.2 is
1.1999999999999999555910790149937383830547332763671875
Given this, what result would you expect now for pow(-1, 1.2)?

If you want to raise negative numbers to powers -- especially fractional powers -- use the cpow() method. You'll need to include <complex> to use it.

It seems like you're looking for pow(abs(x), y).
Explanation: you seem to be thinking in terms of
xy = (xN)(y/N)
If we choose that N === 2, then you have
(x2)y/2 = ((x2)1/2)y
But
(x2)1/2 = |x|
Substituting gives
|x|y
This is a stretch, because the above manipulations only work for non-negative x, but you're the one who chose to use that assumption.

Sounds like you want to perform a complex power (cpow()) and then take the magnitude (abs()) of that after.
>>> abs(cmath.exp(1.2*cmath.log(-1)))
1.0
>>> abs(cmath.exp(1.2*cmath.log(-293.2834)))
913.57662451612202

pow(a,b) is often thought of, defined as, and implemented as exp(log(a)*b) where log(a) is natural logarithm of a. log(a) is not defined for a<=0 in real numbers. So you need to either write a function with special case for negative a and integer b and/or b=1/(some_integer). It's easy to special-case for integer b, but for b=1/(some_integer) it's prone to round-off problems, like Sven Marnach pointed out.
Maybe for your domain pow(-a,b) should always be -pow(a,b)? But then you'd just implement such function, so I assume the question warrants more explanation .
Like duskwuff suggested, a much more robust and "mathematical" solution is to use complex functions log and exp, but it's much more "complex" (excuse my pun) than it seems on the surface (even though there's cpow function). And it'll be much slower if you have to compute a lot of pow()s.
Now there's an important catch with complex numbers that may or may not be relevant to your problem domain: when done right, the result of pow(a,b) is not one, but often a few complex numbers, but in the cases you care about, one of them will be complex number with nearly-zero imaginary part (it'll be non-zero due to roundoff errors) which you can simply ignore and/or not compute in your code.
To demonstrate it, consider what pow(-1,.5) is. It's a number X such that X^2==-1. Guess what? There are 2 such numbers: i and -i. Generally, pow(-1, 1/N) has exactly N solutions, although you're interested in only one of them.
If the imaginary part of all results of pow(a,b) is significant, it means you are passing wrong values. For single-precision floating point values in the range you describe, 1e-6*max(abs(a),abs(b)) would be a good starting point for defining the "significant enough" threshold. The extreme "wrong values" would be pow(-1,0.5) which would return 0 + 1i (0 in real part, 1 in imaginary part). Here the imaginary part is huge relative to the input and real part, so you know you screwed up your input values.
In any reasonable single-return-result implementation of cpow() , cpow(-1,0.3333) will probably return something like -1+0.000001i and ignore two other values with significant imaginary parts. So you can just take that real value and that's your answer.

Use std::complex. Without that, the roots of unity don't make much sense. With it they make a whole lot of sense.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js