An Alternative to Floating-Point for Storing Simple Fractional Values - c++

Firstly, the problem I'm trying to solve is coming up with a better representation for values that will always remain uniformly distributed in the range:
0.0 <= x < 1.0
The motivation for this is to attempt to reduce the number of bytes used to store this data (the application is heavily memory and I/O bandwidth bound). Currently a 32-bit floating-point representation is used, 16-bit floating-point is proving insufficiently accurate.
My initial thoughts are to try and store the data in a 16-bit integer and to simply use the scheme:
x/(2^16 - 1) [x is an unsigned short]
To keep the algorithms largely the same and to retain use of the same floating-point hardware operations (at least at first), I would ideally like to keep converting this fractional representation into floating-point representation, performing the operation(s), then converting back into fractional representation for storage.
Clearly, there will be a loss of precision going back and forth between these two quite different, imprecise representations, but for our application, I suspect this might be an acceptable tradeoff.
I've done some research looking at what is currently out there that might give us a good starting point. The seminal "What Every Computer Scientist Should Know About Floating-Point Arithmetic" article (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) led me to look at a few others, "Beyond Floating Point" (home.ccil.org/~cowan/temp/p319-clenshaw.pdf) being one such example.
Can anyone point me to other examples of representations that people have used elsewhere that might satisfy these requirements?
I'm concerned that any potential gain in exactness of representation (we're wasting much of the floating-point format currently by using this specific range) will be completely out-weighed by the requirement to round twice going from fractional representation to floating-point and back again. In which case, it may be required to do arithmetic using this fractional representation directly to get any benefit out of this approach. Any advice on this point would be helpful?

Don't use 2^16-1. Use 2^16. Yes, you will have very slightly less precision and waste your 0xFFFF, but you will guarantee that there is no loss in precision when converting to floating point. (In contrast, when converting away from floating point, you will lose 8 bits of mantissal precision.)
Round-trip conversions between precisions can cause problems with certain operations, in particular progressively summing numbers. If at all possible, treat your fixed-point values as "dirty", and don't use them for further floating-point computations; prefer recalculating from inputs to using intermediate results which are in fixed-point form.
Alternatively, use 24 bits. With this representation, you will lose no precision in either direction as long as your values don't underflow (that is, as long as they're above 2^-24).

Wouldn't 1/x be badly distributed in your range? 1/2 1/3 1/4 .. do you not want to represent numbers above 1/2?
This kind of thing is done in Netcdf quite a lot to encode data for saving space.
const double scale = 1.0/65536;
unsigned short x;
Any number in x is really x*scale
See example in NetCDF for a more general approach using scale and offset: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/tutorial/NetcdfDataset.html

Have a look at "Packed Data Values" section of this page:
https://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#Packed%20Data%20Values

Related

Z3 - Floating point arithmetic API function Z3_mk_fpa_to_ubv

I am playing around with Z3-4.6.0 C++ for the first time. Sorry for the noob questions.
My question has 2 parts.
If I have a floating point number, and I use Z3_mk_fpa_to_ubv(...) function to create an unsigned bit-vector.
How much precision is lost?
If the precision is not lost, can I use this new unsigned bit-vector as a regular bit-vector and apply all operations defined for it for e.g., Z3_mk_bvadd(....)?
I know I can use Z3_mk_fpa_to_ieee_bv(....) for graceful, and IEEE-754 compliant conversion. Afterwards I can add,sub etc the bit-vectors.
Just being curious.
Thank you very much.
I'm afraid you're misinterpreting the role of these functions. A good reference to keep open while working with SMTLib floats is: http://smtlib.cs.uiowa.edu/papers/BTRW15.pdf
mk_fpa_to_ubv
This function corresponds to the FPToUInt function in the cited paper. It's defined as follows:
(The NaN choice above is misleading: It should be read as "undefined.")
Note that the precision loss can be huge here, depending on what the FP value is and the bit-width of the vector. Imagine converting a double-precision floating point value to an 8-bit word: You're smashing values in the range ±2.23×10^−308 to ±1.80×10^308 to a mere 256 different values. This means a large number of conversions simply will go through massive rounding cancelations.
You should think of this as "casting" in C like languages:
unsigned char c;
double f;
c = (char) f;
This is the essence of conversion from double-precision to unsigned byte, which will suffer major precision loss. In the other direction, if you convert to a really large bit-vector (say one that has a thousand bits), then your conversion will still be losing precision per the rounding mode, though you'll be able to cover all the integer values precisely in the range. So, it really depends on what BV-type you convert to and the rounding mode you choose.
mk_fpa_to_ieee_bv
This function has nothing to do with "preserving" the value. So asking "precision loss" here is irrelevant. What it does is that it gives you the underlying bit-vector representation of the floating-point value, per the IEEE-754 spec. The wikipedia article has a good discussion on this representation: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64
In particular, if you interpret the output of this function as a two's complement integer value, you'll get a completely irrelevant value that has nothing to do with the value of the floating-point number itself. (Also, this conversion is not unique since NaN has multiple corresponding bit-vector patterns.)
Summary
Long story short, conversions from floats to bit-vectors will suffer from precision loss not only due to losing the "fractional" part due to rounding, but also due to the limited range, unless you pick a very-large bit-vector size. The IEEE-754 representation conversion does not preserve value, and thus doing arithmetic on values converted via this function is more or less meaningless.

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

cpp division - how to get most accurate outcome?

I want to divide two ull variables and get the most accurate outcome.
what is the best way to do that?
i.e. 5000034 / 5000000 = 1.0000068
If you want "most accurate precision" - you should avoid floating point arithmetics.
You might want to use some big decimal library [whcih usually implements fixed point arithmetic], and will allow you to define the precision you are seeking.
You should avoid floating point arithmetic because thet are not exact [you have finite number of bits to represent infinite number of numbers in every range, so some slicing must occure...]. Fixed point arithmetic [as usually implemented in big decimal libraries] allows you to allocate more bits "on the fly" to represent the number in the desired accuracy.
More info on the floating point issue can be found in this [a bit advanced] article: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Instead of (double)(N) / D, do 1 + ( (double)(N - D) / D)
I'm afraid that “the most accurate outcome” doesn't mean
much. No finite representation can represent all real numbers exactly;
how precise the representation can be depends on the size of the type
and its internal representation. On most implementations, double will
give about 17 decimal digits precision, which is usually several orders
more precise than the input; for a single multiplicatio or division,
double is usually fine. (Problems occur with addition and subtraction
when the difference between the two values is extreme.) There exist
packages which offer larger precision (BigDecimal, BigFloat and the
like), but they are never exact: in the end, the precision is limited by
the amount of memory you're willing to let them use. They're also much
slower than double, and generally (slightly) more difficult to use
correctly (since they have more options, e.g. just how much precision do
you want). The only real answer to your question is another question:
how much precision do you need? And for what sequence of operations?
Rounding errors accumulate, so while double may be largely sufficient
for a single division, it may cause problems if used naïvely for
iterative procedures. Although in such cases, the solution isn't
usually to increase the precision, but to change the algorithm in a way
to avoid the problems. If double gives you the precision you need,
use it in preference to any extended type. If it doesn't, and you don't
have a choice, then choose one of the existing arbitrary precision
libraries, such as GMP.
(You might also have an issue with the way rounding is handled. For
bookkeeping purposes, for example, most jurisdictions have very strict
laws concerning how to round monitary values, and their rules are based
on decimal arithmetic. In such cases, you'll need a numeric type which
does decimal arithmetic in order for the rounding to conform in all
cases.)
Floating point numbers are probably most accurate for multiplication and division, while integers and fixed point numbers are the best choice for addition and subtraction. This follows from the fact that multiplication and division changes the order of magnitude which floating point numbers handle better, while addition and subtraction is some kind of step, which integers and fixed point numbers handle better.
If you want the best accuracy when dividing integers, implement a RationalNumber class containing the numerator and denominator. This way your reslut will always be exact if you avoid arithmetic overflow. This requires that you accept output in fractional form.

Is there a solution for Floating point Arithmetic problems in C++?

I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post # Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.
Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.
I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?
If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.
The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.
To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:
Floating point comparison functions for C#
Are you looking for standard like this:
Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft
Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.