Representing a float in a single byte

Representing a float in a single byte - c++

I have a situation which requires a float to be represented in a single char. The range that this 'minifloat' needs to represent is 0 to 10e-7, so we can always assume that the number is +ve, and the exponent -ve in order to save space.
The representation that I have thought about going with is 3 bits of exponent, and 5 bits mantissa (with 1 implied bit), with the exponent being in base 10, i.e. x = man * 10^exp.
To convert from a float to my minifloat, I plan to use frexp, and use some maths to convert from base 2 to base 10.
Is this a sensible approach? Or are there better ways to achieve this?

Do you actually need the value to be floating point (i.e. to have roughly constant precision as the value scales)? What are you going to do with these values?
A much simpler (and more efficient) idea would be to interpret 8 bits as an unsigned fixed-point number with an implicit scale of 1e-7. I.e.:
float toFloat(uint8_t x) {
return x / 255.0e7;
}
uint8_t fromFloat(float x) {
if (x < 0) return 0;
if (x > 1e-7) return 255;
return 255.0e7 * x; // this truncates; add 0.5 to round instead
}

If it serves your purposes, it is reasonable to use such a format as a storage or transmission format, that is, for recording data in a small space. You should verify that the rounding errors from this format are not too large for your needs, that the range is suitable, et cetera.
This would not be a good format for calculation, because it would be slow on normal hardware.
I do not understand what base conversion you would be doing. If you have an IEEE-754 floating-point number in a float, then the job of converting to or from your 8-bit format is one of rounding the significand (the fraction) when going to the narrower format and of adjusting the exponent bias, plus handling special cases (denormals, overflow, NaNs). This would just involve binary arithmetic, not decimal.
As an aside, note that the proper term for the fraction portion of a floating-point number is “fraction” or “significand” (the term used in the IEEE-754 standard). A “mantissa” is the fractional portion of a logarithm.

An alternative is to use a static array of 256 float (or double) that you will choose on your own criteria.
Then the conversion unsigned char -> float/double is trivial...
The conversion float/double-> unsigned char is a bit more involved (find nearest float in the static array); it would cost about 8 comparisons with a naive binary search algorithm, but you may find better according to the way you choosed the values in the static array.
Of course, operations would be performed with native float/double.

5 mantissa bits give you 32 different situations from 1.00 to 9.00 with the minimum step size 0.25
1.00 1.25 1.50 1.75 2.00 .... 8.75 9.00
3 exponents can give you 8 different situations 10^0(which is 1) 10^-2 10^-3 10^-4 ....finally 10^-7
Your fraction part's error is 0.25. If your calculations can compensate this error, then you can use this.

Related

What's the meaning of shifting in fixed-point arithmetic when implementing it in C++?

I have a problem with understanding fixed-point arithmetic and its implementation in C++. I was trying to understand this code:
#define scale 16
int DoubleToFixed(double num){
return num * ((double)(1 << scale));
}
double FixedToDoble(int num){
return (double) num / (double)(1 << scale);
}
double IntToFixed(int num){
return x << scale
}
I am trying to understand exactly why we shift. I know that shifting to the right is basically multiplying that number by 2x, where x is by how many positions we want to shift or scale, and shifting to the left is basically division by 2x.
But why do we need to shift when we convert from int to fixed point?

A fixed-point format represents a number as an integer multiplied by a fixed scale. Commonly the scale is some base b raised to some power e, so the integer f would represent the number f•be.
In the code shown, the scale is 2−16 or 1/65,536. (Calling the the shift amount scale is a misnomer; 16, or rather −16, is the exponent.) So if the integer representing the number is 81,920, the value represented is 81,920•2−16 = 1.25.
The routine DoubleToFixed converts a floating-point number to this fixed-point format by multiplying by the reciprocal of the scale; it multiplies by 65,536.
The routine FixedToDouble converts a number from this fixed-format to floating-point by multiplying by the scale or, equivalently, by dividing by its reciprocal; it divides by 65,536.
IntToFixed does the same thing as DoubleToFixed except for an int input.

Fixed point arithmatic works on the concept of representing numbers as an integer multiple of a very small "base". Your case uses a base of 1/(1<<scale), aka 1/65536, which is approximately 0.00001525878.
So the number 3.141592653589793, could be represented as 205887.416146 units of 1/65536, and so would be stored in memory as the integer value 205887 (which is really 3.14158630371, due to the rounding during conversion).
The way to calculate this conversion of fractional-value-to-fixed-point is simply to divide the value by the base: 3.141592653589793 / (1/65536) = 205887.416146. (Notably, this reduces to 3.141592653589793 * 65536 = 205887.416146). However, since this involves a power-of-two. Multiplication by a power-of-two is the same as simply left shifting by that many bits. So multiplication of 2^16, aka 65536, can be calculated faster by simply shifting left 16 bits. This is really fast, which is why most fixed-point calculations use an inverse-power-of-two as their base.
Due to the inability to shift float values, your methods convert the base to a float and does floating point multiplication, but other methods, such as the fixed-point multiplication and division themselves would be able to take advantage of this shortcut.
Theoretically, one can use shifting bits with floats to do the conversion functions faster than simply floating point multiplication, but most likely, the compiler is actually already doing that under the covers.
It is also common for some code to use an inverse-power-of-ten as their base, primarily for money, which usually uses a base of 0.01, but these cannot use a single shift as a shortcut, and have to do slower math. One shortcut for multiplying by 100 is value<<6 + value<<5 + value<<2 (this is effectively value*64+value*32+value*4, which is value*(64+32+4), which is value*100), but three shifts and three adds is sometimes faster than one multiplication. Compilers already do this shortcut under the covers if 100 is a compile time constant, so in general, nobody writes code like this anymore.

Given an `int A` Is there a strong guarantee that `A == (int) (double) A`?

I need a strong guarantee that int x = (int) std::round(y) will always give the correct results (y is finite and "humanly", e.g. -50000 to 50000).
std::round(4.1) can give 4.000000000001 or 3.99999999999. In the latter case, casting to int gives 3, right?
To manage this, I reinvented the wheel with this ugly function:
template<std::integral S = int, std::floating_point T>
S roundi(T x)
{
S r = (S) x;
T r2 = std::fmod(x, 1);
if (r2 >= 0.5) return r + 1;
if (r2 <= -0.5) return r - 1;
return r;
}
But is this necessary? Or does casting from double to int use the last mantissa bit for rounding?

Assuming int is 32 bits wide and double is 64 bits wide (and assuming IEEE 754), all values of int are exactly representable in a double.
That means std::round(4.1) returns exactly 4. Nothing more nothing less. And casting that number to int is always 4 exactly.

std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
No, it cannot. The result of std::round is always an integer, exactly, with no rounding error.
I need strong guarantee that int x = (int) std::round(y) will give always the correct results (y is finite and "humanly" e.g. -50000 to
50000).
C++ inherits its floating-point model from C, and, per C 2018 5.2.4.2.2 12, double is capable of representing at least ten-digit integers, so [−50,000, +50,000] is well within its range. It is even within the range of float, which is capable of representing six-digit integers. This requirement extends back to C 1990.
Given an int A Is there a strong guarantee that A == (int) (double) A?
No, the C++ standard does not impose an upper limit on the width of int nor a relationship between with precision of int (number of bits it uses for the value, excluding the sign bit) and the precision of double (number of bits or other digits in its significand), so a C++ implementation may have an int with more precision than double.

std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
That's true. 4.1 can be seen as 4.0 (which has exact representation in floating point as an integer it is) plus 0.1, which can be seen as 1/10 (it's exactly 1/10, indeed) And the problem you will have is if you try to round a number close to that to one decimal point after the decimal mark (rounding to an integer multiple of 0.1 or 0.01 or 0.001, etc.)
If you are using decimal floating point (which normally C compilers don't) then you are lucky, as 0.1 is 10&^(-1) which again has an exact representation in the machine. But as a binary floating point number, it has an infinite representation in binary as 0.000110011001100110011001100...b and it depends where you cut the number you will get some value or another, but you will never get the exact value as a decimal number (with a finite number of digits)
But the way round() works is not that... if first adds 0.5 (which is exactly representable as a binary floating point number) to the number (this results in an exact operation, no rounding error emerges from it), and then cuts the integer part (which is also an exact operation), meaning that you are getting always an exact integer result (which is perfectly representable as an exact floating point, if the original number was). The rounding is equivalent to this set of operations:
(int)(4.1 + 0.5);
so you will get the integer part of 4.6 after addding the 0.5 part (or something like 4.60000000000000003, 4.59999999999999998, anyway both will be truncated to 4.0, which is also exactly representable in binary floating point format) so you will never get a wrong answer for the rounding to integer case... you can get a wrong response in case you get something close to 4.5 (which can round to 4.0 instead of the correct rounding to 5.0, but .5 happens to be exactly 0.1b in binary... and so it's not affected --
Beware although that rounding to multiples of a negative power of ten (0.1, 0.01, ...) is not warranted, as none of those numbers is representable exactly in binary floating point. All of them have an infinite representation as binary numbers, and due to the cutting at some point, they can be represented as a tiny number above or below (depending on which is close) and the rounding will not work.

c++ incorrect floating point arithmetic

For the following program:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
for (float a = 1.0; a < 10; a++)
cout << std::setprecision(30) << 1.0/a << endl;
return 0;
}
I recieve the following output:
1
0.5
0.333333333333333314829616256247
0.25
0.200000000000000011102230246252
0.166666666666666657414808128124
0.142857142857142849212692681249
0.125
0.111111111111111104943205418749
Which is definitely not right right for the lower place digits, particularly with respect to 1/3,1/5,1/7, and 1/9. things just start going wrong around 10^-16 I would expect to see out put more resembling:
1
0.5
0.333333333333333333333333333333
0.25
0.2
0.166666666666666666666666666666
0.142857142857142857142857142857
0.125
0.111111111111111111111111111111
Is this an inherit flaw in the float class? Is there a way to overcome this and have proper division? Is there a special datatype for doing precise decimal operations? Am I just doing something stupid or wrong in my example?

There are a lot of numbers that computers cannot represent, even if you use float or double-precision float. 1/3, or .3 repeating, is one of those numbers. So it just does the best it can, which is the result you get.
See http://floating-point-gui.de/, or google float precision, there's a ton of info out there (including many SO questions) on this subject.
To answer your questions -- yes, this is an inherent limitation in both the float class and the double class. Some mathematical programs (MathCAD, probably Mathematica) can do "symbolic" math, which allows calculation of the "correct" answers. In many cases, the round-off error can be managed, even over really complex computations, such that the top 6-8 decimal places are correct. However, the opposite is true as well -- naive computations can be constructed that return wildly incorrect answers.
For small problems like division of whole numbers, you'll get a decent number of decimal place accuracy (maybe 4-6 places). If you use double precision floats, that will go up to maybe 8. If you need more... well, I'd start questioning why you want that many decimal places.

First of all, since your code does 1.0/a, it gives you double (1.0 is a double value, 1.0f is float) as the rules of C++ (and C) always extends a smaller type to the larger one if the operands of an operation is different size (so, int + char makes the char into an int before adding the values, long + int will make the int long, etc, etc).
Second floating point values have a set number of bits for the "number". In float, that is 23 bits (+ 1 'hidden' bit), and in double it's 52 bits (+1). Yet get approximately 3 digits per bit (exactly: log2(10), if we use decimal number representation), so a 23 bit number gives approximately 7-8 digits, a 53 bit number approximately 16-17 digits. The remainder is just "noise" caused by the last few bits of the number not evening out when converting to a decimal number.
To have infinite precision, we would have to either store the value as a fraction, or have an infinite number of bits. And of course, we could have some other finite precision, such as 100 bits, but I'm sure you'd complain about that too, because it would just have another 15 or so digits before it "goes wrong".

Floats only have so much precision (23 bits worth to be precise). If you REALLY want to see "0.333333333333333333333333333333" output, you could create a custom "Fraction" class which stores the numerator and denominator separately. Then you could calculate the digit at any given point with complete accuracy.

C++ function that do base 10 significant + exponent calculation from double

I need to represent numbers using the following structure. The purpose of this structure is not to lose the precision.
struct PreciseNumber
{
long significand;
int exponent;
}
Using this structure actual double value can be represented as value = significand * 10e^exponent.
Now I need to write utility function which can covert double into PreciseNumber.
Can you please let me know how to extract the exponent and significand from the double?

The prelude is somewhat flawed.
Firstly, barring any restrictions on storage space, conversion from a double to a base 10 significand-exponent form won't alter the precision in any form. To understand that, consider the following: any binary terminating fraction (like the one that forms the mantissa on a typical IEEE-754 float) can be written as a sum of negative powers of two. Each negative power of two is a terminating fraction itself, and hence it follows that their sum must be terminating as well.
However, the converse isn't necessarily true. For instance, 0.3 base 10 is equivalent to the non-terminating 0.01 0011 0011 0011 ... in base 2. Fitting this into a fixed size mantissa would blow some precision out of it (which is why 0.3 is actually stored as something that translates back to 0.29999999999999999.)
By this, we may assume that any precision that is intended by storing the numbers in decimal significand-exponent form is either lost, or isn't simply gained at all.
Of course, you might think of the apparent loss of accuracy generated by storing a decimal number as a float as loss in precision, in which case the Decimal32 and Decimal64 floating point formats may be of some interest -- check out http://en.wikipedia.org/wiki/Decimal64_floating-point_format.

This is a very difficult problem. You might want to see how much code it takes to implement a double-to-string conversion (for printf, e.g.). You might steal the code from gnu's implementation of gcc.

You cannot convert an "imprecise" double into a "precise" decimal number, because the required "precision" simply isn't there to begin with (otherwise why would you even want to convert?).
This is what happens if you try something like it in Java:
BigDecimal x = new BigDecimal(0.1);
System.out.println(x);
The output of the program is:
0.1000000000000000055511151231257827021181583404541015625

Well you're at less precision than a typical double. Your significand is a long giving you a range from -2 billion to +2 billion which is more than 9 but fewer than 10 digits of precision.
Here's an untested starting point on what you'd want to do for some simple math on PreciseNumbers
PreciseNumber Multiply(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s*=rhs.s;
ret.e+=lhs.e;
return ret;
}
PreciseNumber Add(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s+=(rhs.s*pow(10,rhs.e-lhs.e));
}
I didn't take care of any renormalization, but in both cases there are places where you have to worry about over/under flows and loss of precision. Just because you're doing it yourself rather than letting the computer take care of it in a double, doesn't meat the same pitfalls aren't there. The only way to not lose precision is to keep track of all of the digits.

Here's a very rough algorithm. I'll try to fill in some details later.
Take the log10 of the number to get the exponent. Multiply the double by 10^x if positive, or divide by 10^-x if negative.
Start with a significand of zero. Repeat the following 15 times, since a double contains 15 digits of significance:
Multiply the previous significand by 10.
Take the integer portion of the double, add it to the significand, and subtract it from the double.
Subtract 1 from the exponent.
Multiply the double by 10.
When finished, take the remaining double value and use it for rounding: if it's >= 5, add one to the significand.

double type digits in C++

The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??
Ref:
http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx :
.
.
Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences:
.
.
Example nos:
d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie
d1 = 97842111437.390000
d2 = 97842111437.390000

No. Counter-example: the two closest floating-point numbers to a rational
1.11111111111118
(which has 15 decimal digits) are
1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625
In other words, there is not floating-point number that starts with 1.1111111111111800.

This question is a little malformed. The hardware stores the numbers
in binary, not decimal. So in the general case you can't do precise
math in base 10. Some decimal numbers (0.1 is one of them!) do not
even have a non-repeating representation in binary. If you have
precision requirements like this, where you care about the number
being of known precision to exactly 15 decimal digits, you will need
to pick another representation for your numbers.

No, but I wonder if this is relevant to any of your issues (GCC specific):
GCC Documentation
-ffloat-store Do not store floating point variables in registers, and
inhibit other options that might
change whether a floating point value
is taken from a register or memory.
This option prevents undesirable
excess precision on machines such as
the 68000 where the floating registers
(of the 68881) keep more precision
than a double is supposed to have.
Similarly for the x86 architecture.
For most programs, the excess
precision does only good, but a few
programs rely on the precise
definition of IEEE floating point. Use
-ffloat-store for such programs, after modifying them to store all pertinent
intermediate computations into
variables.

You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.
#include <stdio.h>
union double_int {
double fp;
unsigned long long integer;
};
int main(int argc, const char *argv[])
{
double my_double = 1325.34634;
union double_int *my_union = (union double_int *)&my_double;
/* print original numbers */
printf("Float %f\n", my_double);
printf("Integer %llx\n", my_union->integer);
/* whack the sign bit to 1 */
my_union->integer |= 1ULL << 63;
/* print modified numbers */
printf("Negative float %f\n", my_double);
printf("Negative integer %llx\n", my_union->integer);
return 0;
}

Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.
If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).
Playing with the bits of the number directly isn't a great idea.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js