Floating point versus fixed point: what are the pros/cons?

Floating point versus fixed point: what are the pros/cons? - c++

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?

That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.

At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.

Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip

You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.

The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.

You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.

One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.

It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.

Related

How does the CPU "cast" a floating point x87 (i think) value?

I just wanted to know how the CPU "Cast" a floating point number.
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong? (i couldn't find the answer) So, if this is the case and the floating point numbers are not emulated how does the compiler cast it?

I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong?
On modern Intel processors, the compiler is likely to use the SSE/AVX registers. The FPU is often not in regular use.
I just wanted to know how the CPU "Cast" a floating point number.
Converting an integer to a floating-point number is a computation that is basically (glossing over some details):
Start with the binary (for unsigned types) or two’s complement (for signed types) representation of the integer.
If the number is zero, return all bits zero.
If it is negative, remember that and negate the number to make it positive.
Locate the highest bit set in the integer.
Locate the lowest bit that will fit in the significand of the destination format. (For example, for the IEEE-754 binary32 format commonly used for float, 24 bits fit in the significand, so the 25th bit after the highest bit set does not fit.)
Round the number at that position where the significand will end.
Calculate the exponent, which is a function of where the highest bit set is. Add a “bias” used in encoding the exponent (127 for binary32, 1023 for binary64).
Assemble a sign bit, bits for the exponent, and bits for the significand (omitting the high bit, because it is always one). Return those bits.
That computation prepares the bits that represent a floating-point number. (It omits details involving special cases like NaNs, infinities, and subnormal numbers because these do not occur when converting typical integer formats to typical floating-point formats.)
That computation may be performed “in software” (that is, with general instructions for shifting bits, testing values, and so on) or “in hardware” (that is, with special instructions for doing the conversion). All desktop computers have instructions for this. Small processors for special-purpose embedded use might not have such instructions.

It is not clear what do you mean by
"Cast" a floating point number. ?
If target architecture has FPU then compiler will issue FPU instructions in order to manipulate floating point variables, no mistery there...
In order to assign float variable to int variable, float must be truncated or rounded(up or down). Special instructions usually exists to serve this purpose.
If target architecture is "FPU-less" then compiler(toolchain) might provide software implementation of floating point operations using CPU instructions available. For example, expression like a = x * y; will be equivalent to a = fmul(x, y); Where fmul() is compiler provided special function(intrinsic) to do floating point operations without FPU. Ofcourse this is typically MUCH slower than using hardware FPU. Floating point arithmetic is not used on such platforms if performance matters, fixed point arithmetic https://en.wikipedia.org/wiki/Fixed-point_arithmetic could be used instead.

Handling endianness of floating point values when there is no fixed size floating point type available

I'm writing a binary file reader/writer and have decided that to handle the issue of endianness I will convert all data to "network" (big) endianness on writing and to host endianness on reading. I'm avoiding hton* because I don't want to link with winsock for just those functions.
My main point of confusion comes from how to handle floating point values. For all integral values I have the sized types in <cstdint> (uint32_t, etc.), but from my research no such equivalent exists for floating point types. I'd like to convert all floating point values to a 32 bit representation on writing and convert back to whatever precision is used on the host (32 bit is enough for my application). This way I will know precisely how many bytes to write and read for floating point values; as opposed to if I used sizeof(float) and sizeof(float) was different on the machine loading the file than the machine that wrote it.
I was just made aware of the possibility of using frexp to get the mantissa and exponent in integer terms, writing those integers out (with some fixed size), then reading the integers in and reconstructing the floating point value using ldexp. This looks promising, but I am wondering if there is any generally accepted or recommended method for handling float endianness without htonf/ntohf.
I know with almost certainly any platform I'll be targeting anytime soon will have float represented with 32-bits, but I'd like to make the code I write now as compatible as I can for use in future projects.

If you want to be completely cross-platform and standards-compliant, then the frexp/ldexp solution is the best way to go. (Although you might need to consider the highly theoretical case where either the source or the target hardware uses decimal floating point.)
Suppose that one or the other machine did not have a 32-bit floating point representation. Then there is no datatype on that machine bit-compatible with a 32-bit floating pointer number, regardless of endianness. So there is then no standard way of converting the non-32-bit float to a transmittable 32-bit representation, or to convert the transmitted 32-bit representation to a native non-32-bit floating point number.
You could restrict your scope to machines which have a 32-bit floating point representation, but then you will need to assume that both machines have the same number and order of bits dedicated to sign, exponent and mantissa. That's likely to be the case, since IEEE-754 format is almost universal these days, but C++ does not insist on it and it is at least conceivable that there is a machine which implements 1/8/23-bit floating point numbers with the sign bit at the low-order end instead of the high-order end.
In short, endianness is only one of the possible incompatibilities between binary floating point formats. Reducing every floating point number to two integers, however, avoids having to deal with other incompatibilities (other than radix).

An Alternative to Floating-Point for Storing Simple Fractional Values

Firstly, the problem I'm trying to solve is coming up with a better representation for values that will always remain uniformly distributed in the range:
0.0 <= x < 1.0
The motivation for this is to attempt to reduce the number of bytes used to store this data (the application is heavily memory and I/O bandwidth bound). Currently a 32-bit floating-point representation is used, 16-bit floating-point is proving insufficiently accurate.
My initial thoughts are to try and store the data in a 16-bit integer and to simply use the scheme:
x/(2^16 - 1) [x is an unsigned short]
To keep the algorithms largely the same and to retain use of the same floating-point hardware operations (at least at first), I would ideally like to keep converting this fractional representation into floating-point representation, performing the operation(s), then converting back into fractional representation for storage.
Clearly, there will be a loss of precision going back and forth between these two quite different, imprecise representations, but for our application, I suspect this might be an acceptable tradeoff.
I've done some research looking at what is currently out there that might give us a good starting point. The seminal "What Every Computer Scientist Should Know About Floating-Point Arithmetic" article (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) led me to look at a few others, "Beyond Floating Point" (home.ccil.org/~cowan/temp/p319-clenshaw.pdf) being one such example.
Can anyone point me to other examples of representations that people have used elsewhere that might satisfy these requirements?
I'm concerned that any potential gain in exactness of representation (we're wasting much of the floating-point format currently by using this specific range) will be completely out-weighed by the requirement to round twice going from fractional representation to floating-point and back again. In which case, it may be required to do arithmetic using this fractional representation directly to get any benefit out of this approach. Any advice on this point would be helpful?

Don't use 2^16-1. Use 2^16. Yes, you will have very slightly less precision and waste your 0xFFFF, but you will guarantee that there is no loss in precision when converting to floating point. (In contrast, when converting away from floating point, you will lose 8 bits of mantissal precision.)
Round-trip conversions between precisions can cause problems with certain operations, in particular progressively summing numbers. If at all possible, treat your fixed-point values as "dirty", and don't use them for further floating-point computations; prefer recalculating from inputs to using intermediate results which are in fixed-point form.
Alternatively, use 24 bits. With this representation, you will lose no precision in either direction as long as your values don't underflow (that is, as long as they're above 2^-24).

Wouldn't 1/x be badly distributed in your range? 1/2 1/3 1/4 .. do you not want to represent numbers above 1/2?
This kind of thing is done in Netcdf quite a lot to encode data for saving space.
const double scale = 1.0/65536;
unsigned short x;
Any number in x is really x*scale
See example in NetCDF for a more general approach using scale and offset: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/tutorial/NetcdfDataset.html

Have a look at "Packed Data Values" section of this page:
https://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#Packed%20Data%20Values

Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)

Since there are two ways of implementing an AP fractional number, one is to emulate the storage and behavior of the double data type, only with more bytes, and the other is to use an existing integer APA implementation for representing a fractional number as a rational i.e. as a pair of integers, numerator and denominator, which of the two ways are more likely to deliver efficient arithmetic in terms of performance? (Memory usage is really of minor concern.)
I'm aware of the existing C/C++ libraries, some of which offer fractional APA with "floats" and other with rationals (none of them features fixed-point APA, however) and of course I could benchmark a library that relies on "float" implementation against one that makes use of rational implementation, but the results would largely depend on implementation details of those particular libraries I would have to choose randomly from the nearly ten available ones. So it's more theoretical pros and cons of the two approaches that I'm interested in (or three if take into consideration fixed-point APA).

The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?
In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).
Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.
Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.
Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).
So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?

Since no one else seemed to mention this, rationals and floats represent different sets of numbers. The value 1/3 can be represented precisely with a rational, but not a float. Even an arbitrary precision float would take infinitely many mantissa bits to represent a repeating decimal like 1/3. This is because a float is effectively like a rational but where the denominator is constrained to be a power of 2. An arbitrary precision rational can represent everything that an arbitrary precision float can and more, because the denominator can be any integer instead of just powers of 2. (That is, unless I've horribly misunderstood how arbitrary precision floats are implemented.)
This is in response to your prompt for theoretical pros and cons.
I know you didn't ask about memory usage, but here's a theoretical comparison in case anyone else is interested. Rationals, as mentioned above, specialize in numbers that can be represented simply in fractional notation, like 1/3 or 492113/203233, and floats specialize in numbers that are simple to represent in scientific notation with powers of 2, like 5*2^45 or 91537*2^203233. The amount of ascii typing needed to represent the numbers in their respective human-readable form is proportional to their memory usage.
Please correct me in the comments if I've gotten any of this wrong.

Either way, you'll need multiplication of arbitrary size integers. This will be the dominant factor in your performance since its complexity is worse than O(n*log(n)). Things like aligning operands, and adding or subtracting large integers is O(n), so we'll neglect those.
For simple addition and subtraction, you need no multiplications for floats* and 3 multiplications for rationals. Floats win hands down.
For multiplication, you need one multiplication for floats and 2 multiplications for rational numbers. Floats have the edge.
Division is a little bit more complex, and rationals might win out here, but it's by no means a certainty. I'd say it's a draw.
So overall, IMHO, the fact that addition is at least O(n*log(n)) for rationals and O(n) for floats clearly gives the win to a floating-point representation.
*It is possible that you might need one multiplication to perform addition if your exponent base and your digit base are different. Otherwise, if you use a power of 2 as your base, then aligning the operands takes a bit shift. If you don't use a power of two, then you may also have to do a multiplication by a single digit, which is also an O(n) operation.

You are effectively asking the question: "I need to participate in a race with my chosen animal. Should I choose a turtle or a snail ?".
The first proposal "emulating double" sounds like staggered precision: using an array of doubles of which the sum is the defined number. There is a paper from Douglas M. Priest "Algorithms for Arbitrary Precision Floating Point Arithmetic" which describes how to implement this arithmetic. I implemented this and my experience is very bad: The necessary overhead to make this run drops the performance 100-1000 times !
The other method of using fractionals has severe disadvantages, too: You need to implement gcd and kgv and unfortunately every prime in your numerator or denominator has a good chance to blow up your numbers and kill your performance.
So from my experience they are the worst choices one can made for performance.
I recommend the use of the MPFR library which is one of the fastest AP packages in C and C++.

Rational numbers don't give arbitrary precision, but rather the exact answer. They are, however, more expensive in terms of storage and certain operations with them become costly and some operations are not allowed at all, e.g. taking square roots, since they do not necessarily yield a rational answer.
Personally, I think in your case AP floats would be more appropriate.

precision differences in matlab and c++

I am trying to make equivalence tests on an algorithm written in C++ and in Matlab.
The algorithm contains some kind of a loop in time and runs more than 1000 times. It has arithmetic operations and some math functions.
I feed the initial inputs to both platforms by hand (like a=1.767, b=6.65, ...) and when i check the hexadecimal representations of those inputs they are the same. So no problem for inputs. And get the outputs of c++ to matlab by a text file with 16 decimal digits. (i use "setprecision(32)" statement)
But here comes the problem; although after the 614'th step of both code, all the results are exactly the same, on the step of 615 I get a difference about 2.xxx..xxe-19? And after this step the error becomes larger and larger, and at the end of the runs it is about 5.xx..xxe-14.
0x3ff1 3e42 a211 6cca--->[C++ function]--->0x3ff4 7619 7005 5a42
0x3ff1 3e42 a211 6cca--->[MATLAB function]--->ans
ans - 0x3ff4 7619 7005 5a42
= 2.xxx..xxe-19
I searched how matlab behaves the numbers and found really interesting things like "denormalized mantissa". While realmin is about e-308, by denormalizing the mantissa matlab has the smallest real number about e-324. Further matlab holds many more digits for "pi" or "exp(1)" than that of c++.
On the other hand, matlab help says that whatever the format it displays, matlab uses the double precision internally.
So,I'd really appreciate if someone explains what the exact reason is for these differences? How can we make equivalence tests on matlab and c++?

There is one thing in x86 CPU about floating points numbers. Internally, the floating point unit uses registers that are 10 bytes, i.e. 80 bits. Furthermore, the CPU has a setting that tells whether the floating point calculations should be made with 32 bits (float), 64 bits (double) or 80 bits precision. Less precision meaning faster executed floating point operations. (The 32 bits mode used to be popular for video games, where speed takes over precision).
From this I remember I tracked a bug in a calculation library (dll) that given the same input did not gave the same result whether it was started from a test C++ executable, or from MatLab.. Furthermore, this did not happen in Debug mode, only in Release!
The final conclusion was that MatLab did set the CPU floating point precision to 80 bits, whereas our test executable did not (and leave the default 64 bits precision). Furthermore, this calculation mismatch did not happen Debug mode because all the variables were written to memory into 64 bits double variables, and reloaded from there afterward, nullifying the additional 16 bits. In Release mode, some variables were optimized out (not written to memory), and all calculations were done with floating point registers only, on 80 bits, keeping the additional 16 bits non-zero value.
Don't know if this helps, but maybe worth knowing.

A similar discussion occurred before, the conclusion was that IEEE 754 tolerates error in the last bit for transcendental functions (cos, sin, exp, etc..). So you can't expect exactly same results between MATLAB and C (not even same C code compiled in different compilers).

I may be way off track here and you may already have investigated this possibility but it could be possible that there are differences between C++ and Matlab in the way that the mathematical library functions (sin() cos() and exp() that you mention) are implemented internally. Ultimately, some kind of functional approximation must be being used to generate function values and if there is some difference between these methods then presumably it is possible that this manifests itself in the form of numerical rounding error over a large number of iterations.
This question basically covers what I am trying to suggest How does C compute sin() and other math functions?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js