Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)

Floats vs rationals in arbitrary precision fractional arithmetic (C/C++) - c++

Since there are two ways of implementing an AP fractional number, one is to emulate the storage and behavior of the double data type, only with more bytes, and the other is to use an existing integer APA implementation for representing a fractional number as a rational i.e. as a pair of integers, numerator and denominator, which of the two ways are more likely to deliver efficient arithmetic in terms of performance? (Memory usage is really of minor concern.)
I'm aware of the existing C/C++ libraries, some of which offer fractional APA with "floats" and other with rationals (none of them features fixed-point APA, however) and of course I could benchmark a library that relies on "float" implementation against one that makes use of rational implementation, but the results would largely depend on implementation details of those particular libraries I would have to choose randomly from the nearly ten available ones. So it's more theoretical pros and cons of the two approaches that I'm interested in (or three if take into consideration fixed-point APA).

The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?
In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).
Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.
Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.
Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).
So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?

Since no one else seemed to mention this, rationals and floats represent different sets of numbers. The value 1/3 can be represented precisely with a rational, but not a float. Even an arbitrary precision float would take infinitely many mantissa bits to represent a repeating decimal like 1/3. This is because a float is effectively like a rational but where the denominator is constrained to be a power of 2. An arbitrary precision rational can represent everything that an arbitrary precision float can and more, because the denominator can be any integer instead of just powers of 2. (That is, unless I've horribly misunderstood how arbitrary precision floats are implemented.)
This is in response to your prompt for theoretical pros and cons.
I know you didn't ask about memory usage, but here's a theoretical comparison in case anyone else is interested. Rationals, as mentioned above, specialize in numbers that can be represented simply in fractional notation, like 1/3 or 492113/203233, and floats specialize in numbers that are simple to represent in scientific notation with powers of 2, like 5*2^45 or 91537*2^203233. The amount of ascii typing needed to represent the numbers in their respective human-readable form is proportional to their memory usage.
Please correct me in the comments if I've gotten any of this wrong.

Either way, you'll need multiplication of arbitrary size integers. This will be the dominant factor in your performance since its complexity is worse than O(n*log(n)). Things like aligning operands, and adding or subtracting large integers is O(n), so we'll neglect those.
For simple addition and subtraction, you need no multiplications for floats* and 3 multiplications for rationals. Floats win hands down.
For multiplication, you need one multiplication for floats and 2 multiplications for rational numbers. Floats have the edge.
Division is a little bit more complex, and rationals might win out here, but it's by no means a certainty. I'd say it's a draw.
So overall, IMHO, the fact that addition is at least O(n*log(n)) for rationals and O(n) for floats clearly gives the win to a floating-point representation.
*It is possible that you might need one multiplication to perform addition if your exponent base and your digit base are different. Otherwise, if you use a power of 2 as your base, then aligning the operands takes a bit shift. If you don't use a power of two, then you may also have to do a multiplication by a single digit, which is also an O(n) operation.

You are effectively asking the question: "I need to participate in a race with my chosen animal. Should I choose a turtle or a snail ?".
The first proposal "emulating double" sounds like staggered precision: using an array of doubles of which the sum is the defined number. There is a paper from Douglas M. Priest "Algorithms for Arbitrary Precision Floating Point Arithmetic" which describes how to implement this arithmetic. I implemented this and my experience is very bad: The necessary overhead to make this run drops the performance 100-1000 times !
The other method of using fractionals has severe disadvantages, too: You need to implement gcd and kgv and unfortunately every prime in your numerator or denominator has a good chance to blow up your numbers and kill your performance.
So from my experience they are the worst choices one can made for performance.
I recommend the use of the MPFR library which is one of the fastest AP packages in C and C++.

Rational numbers don't give arbitrary precision, but rather the exact answer. They are, however, more expensive in terms of storage and certain operations with them become costly and some operations are not allowed at all, e.g. taking square roots, since they do not necessarily yield a rational answer.
Personally, I think in your case AP floats would be more appropriate.

Related

How does printf extract digits from a floating point number?

How do functions such as printf extract digits from a floating point number? I understand how this could be done in principle. Given a number x, of which you want the first n digits, scale x by a power of 10 so that x is between pow(10, n) and pow(10, n-1). Then convert x into an integer, and take the digits of the integer.
I tried this, and it worked. Sort of. My answer was identical to the answer given by printf for the first 16 decimal digits, but tended to differ on the ones after that. How does printf do it?

The classic implementation is David Gay's dtoa. The exact details are somewhat arcane (see Why does "dtoa.c" contain so much code?), but in general it works by doing the base conversion using more precision beyond what you can get from a 32-bit, 64-bit, or even 80-bit floating point number. To do this, it uses so-called "bigints" or arbitrary-precision numbers, which can hold as many digits as you can fit in memory. Gay's code has been copied, with modifications, into countless other libraries including common implementations for the C standard library (so it might power your printf), Java, Python, PHP, JavaScript, etc.
(As a side note... not all of these copies of Gay's dtoa code were kept up to date, so because PHP used an old version of strtod it hung when parsing 2.2250738585072011e-308.)
In general, if you do things the "obvious" and simple way like multiplying by a power of 10 and then converting the integer, you will lose a small amount of precision and some of the results will be inaccurate... but maybe you will get the first 14 or 15 digits correct. Gay's implementation of dtoa() claims to get all the digits correct... but as a result, the code is quite difficult to follow. Skip to the bottom to see strtod itself, you can see that it starts with a "fast path" which just uses ordinary floating-point arithmetic, but then it detects if that result is incorrect and uses a more reliable algorithm using bigints which works in all cases (but is slower).
The implementation has the following citation, which you may find interesting:
* Inspired by "How to Print Floating-Point Numbers Accurately" by
* Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126].
The algorithm works by calculating a range of decimal numbers which produce the given binary number, and by using more digits, the range gets smaller and smaller until you either have an exact result or you can correctly round to the requested number of digits.
In particular, from sec 2.2 Algorithm,
The algorithm uses exact rational arithmetic to perform its computations so that there is no loss of accuracy. In order to generate digits, the algorithm scales the number so that it is of the form 0.d1d2..., where d1, d2, ..., are base-B digits. The first digit is computed by multiplying the scaled number by the output base, B, and taking the integer part. The remainder is used to compute the rest of the digits using the same approach.
The algorithm can then continue until it has the exact result (which is always possible, since floating-point numbers are base 2, and 2 is a factor of 10) or until it has as many digits as requested. The paper goes on to prove the algorithm's correctness.
Also note that not all implementations of printf are based on Gay's dtoa, this is just a particularly common implementation that's been copied a lot.

There are various ways to convert floating-point numbers to decimal numerals without error (either exactly or with rounding to a desired precision).
One method is to use arithmetic as taught in elementary school. C provides functions to work with floating-point numbers, such as frexp, which separates the fraction (also called the significand, often mistakenly called a mantissa) and the exponent. Given a floating-point number, you could create a large array to store decimal digits in and then compute the digits. Each bit in the fraction part of a floating-point number represents some power of two, as determined by the exponent in the floating-point number. So you can simply put a “1” in an array of digits and then use elementary school arithmetic to multiply or divide it the required number of times. You can do that for each bit and then add all the results, and the sum is the decimal numeral that equals the floating-point number.
Commercial printf implementations will use more sophisticated algorithms. Discussing them is beyond the scope of a Stack Overflow question-and-answer. The seminal paper on this is Correctly Rounded Binary-Decimal and Decimal-Binary Conversions by David M. Gay. (A copy appears to be available here, but that seems to be hosted by a third party; I am not sure how official or durable it is. A web search may turn up other sources.) A more recent paper with an algorithm for converting a binary floating-point number to decimal with the shortest number of digits needed to uniquely distinguish the value is Printing Floating-Point Numbers: An Always Correct Method by Marc Andrysco, Ranjit Jhala, and Sorin Lerner.
One key to how it is done is that printf will not just use the floating-point format and its operations to do the work. It will use some form of extended-precision arithmetic, either by working with parts of the floating-point number in an integer format with more bits, by separating the floating-point number into pieces and using multiple floating-point numbers to work with it, or by using a floating-point format with more precision.
Note that the first step in your question, multiple x by a power of ten, already has two rounding errors. First, not all powers of ten are exactly representable in binary floating-point, so just producing such a power of ten necessarily has some representation error. Then, multiplying x by another number often produces a mathematical result that is not exactly representable, so it must be rounded to the floating-point format.

Neither the C or C++ standard does not dictate a certain algorithm for such things. Therefore is impossible to answer how printf does this.
If you want to know an example of a printf implementation, you can have a look here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/vfprintf.c and here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/printf_fp.c

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?

I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

IEEE-754 floating point: Divide first or multiply first for best precision?

What's better if I want to preserve as much precision as possible in a calculation with IEEE-754 floating point values:
a = b * c / d
or
a = b / d * c
Is there a difference? If there is, does it depend on the magnitudes of the input values? And, if magnitude matters, how is the best ordering determined when general magnitudes of the values are known?

It depends on the magnitude of the values. Obviously if one divides by zero, all bets are off, but if a multiplication or division results in a denormal subsequent operations can lose precision.
You may find it useful to study Goldberg's seminal paper What Every Computer Scientist Should Know About Floating-Point Arithmetic which will explain things far better than any answer you're likely to receive here. (Goldberg was one of the original authors of IEEE-754.)

Assuming that none of the operations would yield an overflow or an underflow, and your input values have uniformly distributed significands, then this is equivalent. Well, I suppose that to have a rigorous proof, one should do an exhaustive test (probably not possible in practice for double precision since there are 2^156 inputs), but if there is a difference in the average error, then it is tiny. I could try in low precisions with Sipe.
In any case, in the absence of overflow/underflow, only the exact values of the significands matter, not the exponents.
However if the result a is added to (or subtracted from) another expression and not reused, then starting with the division may be more interesting since you can group the multiplication with the following addition by using a FMA (thus with a single rounding).

cpp division - how to get most accurate outcome?

I want to divide two ull variables and get the most accurate outcome.
what is the best way to do that?
i.e. 5000034 / 5000000 = 1.0000068

If you want "most accurate precision" - you should avoid floating point arithmetics.
You might want to use some big decimal library [whcih usually implements fixed point arithmetic], and will allow you to define the precision you are seeking.
You should avoid floating point arithmetic because thet are not exact [you have finite number of bits to represent infinite number of numbers in every range, so some slicing must occure...]. Fixed point arithmetic [as usually implemented in big decimal libraries] allows you to allocate more bits "on the fly" to represent the number in the desired accuracy.
More info on the floating point issue can be found in this [a bit advanced] article: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Instead of (double)(N) / D, do 1 + ( (double)(N - D) / D)

I'm afraid that “the most accurate outcome” doesn't mean
much. No finite representation can represent all real numbers exactly;
how precise the representation can be depends on the size of the type
and its internal representation. On most implementations, double will
give about 17 decimal digits precision, which is usually several orders
more precise than the input; for a single multiplicatio or division,
double is usually fine. (Problems occur with addition and subtraction
when the difference between the two values is extreme.) There exist
packages which offer larger precision (BigDecimal, BigFloat and the
like), but they are never exact: in the end, the precision is limited by
the amount of memory you're willing to let them use. They're also much
slower than double, and generally (slightly) more difficult to use
correctly (since they have more options, e.g. just how much precision do
you want). The only real answer to your question is another question:
how much precision do you need? And for what sequence of operations?
Rounding errors accumulate, so while double may be largely sufficient
for a single division, it may cause problems if used naïvely for
iterative procedures. Although in such cases, the solution isn't
usually to increase the precision, but to change the algorithm in a way
to avoid the problems. If double gives you the precision you need,
use it in preference to any extended type. If it doesn't, and you don't
have a choice, then choose one of the existing arbitrary precision
libraries, such as GMP.
(You might also have an issue with the way rounding is handled. For
bookkeeping purposes, for example, most jurisdictions have very strict
laws concerning how to round monitary values, and their rules are based
on decimal arithmetic. In such cases, you'll need a numeric type which
does decimal arithmetic in order for the rounding to conform in all
cases.)

Floating point numbers are probably most accurate for multiplication and division, while integers and fixed point numbers are the best choice for addition and subtraction. This follows from the fact that multiplication and division changes the order of magnitude which floating point numbers handle better, while addition and subtraction is some kind of step, which integers and fixed point numbers handle better.
If you want the best accuracy when dividing integers, implement a RationalNumber class containing the numerator and denominator. This way your reslut will always be exact if you avoid arithmetic overflow. This requires that you accept output in fractional form.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?

That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.

At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.

Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip

You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.

The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.

You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.

One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.

It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js