Negative decimal to fixed-point binary - twos-complement

IS -28.91 = 00100.0111 ??
28 -> 11100 then flip and add 1
-28 -> 00100
.91 -> 0111 with the accuracy of 4 decimals places
I have tried to check a lot of places to check my conversion if it is correct but I am failing at it. So I like to ask people here if I am correct.

For addition / subtraction and other operations to work normally (by using binary addition on the whole bit-pattern), the whole thing (integer and fractional parts combined) as an integer has to be x * 2^4.
i.e. the actual value represented by 0b00100.0111 is 0b001000111 / 16.
That means you have to do 2's complement negation (binary subtraction from 0, or use the invert and add 1 identity) for the whole and fractional bits together.
Also, your value for 28 has its MSB set, so it's already negative, i.e. you've overflowed 5-bit signed 2's complement. Presumably you actually have a wider integer part.
For 16-bit 12.4 fixed-point, 28.91:
28.91 * 16 = 462.56, which rounds up to 463.
+463 = 0b0000000111001111
-463 = 0b1111111000110001
As 12.4 fixed-point, this 0b111111100011.0001 bit-pattern represents -463/16 = -28.9375, the nearest representable value to -28.91

Related

How to connect the theory of fixed-point numbers and its practical implementation?

The theory of fixed-point number is that we divide certain number of bits between integer part and fractional part. This amount is fixed.
For example, 26.5 is stored in that order:
To convert from floating-point to fixed-point, we follow this algorithm:
Calculate x = floating_input * 2^(fractional_bits)
27.3 * 2^10 = 27955.2
Round x to the nearest whole number (e.g. round(x))
27955
Store the rounded x in an integer container
Now if we look on the bit representation of our numbers and on what multiplying on 2^(fractional_bits) makes, we will see:
27 is 11011
27*2^10 is 110 1100 0000 0000 which is shifting on 10 bits to the left.
So we can say, that multiplying on 2^10 indeed gives us "space" in the right part of bits for save forth altering of this number. We can make two such numbers converted in this way, interacting each other and eventually re-converted to familiar view with point by opposite dividing on 2^10.
If we recall that bits are stored in some integer variable, which in turn has its own amount of bits it gets clear that as more bits in that variable are devoted for fraction part as less bits remain for integer part of number.
27.3 * 2^10 = 27955.2 should be rounded for storing in integer type to
27955 which is 110 1101 0011 0011
after that number can be altered somehow, certain value isn't important now, and let's say, we want to retrieve back human-readable value:
27955/2^10 = 27,2998046875
What about amount of bits after point?
Let's say we have two numbers with purpose to multiply them and we chose 10 bits after point
27 * 3.3 = 89.1 expected
27*2^10 = 27 648 is 110 1100 0000 0000
3.3*2^10 = 3 379 is 1101 0011 0011
27 648 * 3 379 = 93 422 592
consequently
27*3.3 = 93 422 592/(2^10*2^10) = 89.09 pretty accurate
Let's take 1 bit after point
27 and 3.3
27*2^1 = 54 is 110110
3.3*2^1 = 6.6 after round 6 is 110
54 * 6 = 324
consequently
27*3.3 = 324/(2^1*2^1) = 81 which is unsatisfying
On practice we can use next code to create and operate with fixed-point number:
#include <iostream>
using namespace std;
const int scale = 10;
#define DoubleToFixed(x) (x*(double)(1<<scale))
#define FixedToDouble(x) ((double)x / (double)(1<<scale))
#define IntToFixed(x) (x << scale)
#define FixedToInt(x) (x >> scale)
#define MUL(x,y) (((x)*(y)) >> scale)
#define DIV(x,y) ((x) << scale)
int main()
{
double a = 7.27;
double b = 3.0;
int f = DoubleToFixed(a);
cout << f<<endl; //7444
cout << FixedToDouble(f)<<endl; //7.26953125
int g = DoubleToFixed(b);
cout << g<<endl; //3072
int c = MUL(f,g);
cout << FixedToDouble(c)<<endl; //21.80859375
}
So, where is connection between the theory of fixed emplacement of point between bits (powers of 2) and practice implementation? If we store fixed-number in int, it is obvious, that there is no place for storing the point in it.
It seems that fixed-point numbers are just conversion for increase performance. And to retrieve human-readable number after calculations, the opposite conversion must present.
I hope, I understand the algorithm. But is the idea of placement of point between digits is just an abstract idea?
Fixed-point formats are used as a way to represent fractional numbers. Quite commonly, processors perform fixed-point or integer arithmetic faster or more efficiently than floating-point arithmetic. Whether fixed-point arithmetic is suitable for an application depends on what numbers the application needs to work with.
Using fixed-point formats does require converting input to the fixed-point format and converting numbers in the fixed-point format to output. But this is also true of integers and floating-point: All input must be converted to whatever internal format is used to represent it, and all output must be produced by converting from internal formats.
And how does multiplying on 2^(fractional_bits) affect the quantity of digits after the point?
Suppose we have some number x that is represented as an integer X = x•2f, where f is the number of fraction bits. Conceptually X is in a fixed-point format. Similarly, we have y represented as Y = y•2f.
If we execute an integer multiplication instruction to produce result Z = XY, then Z = XY = (x•2f)•(y•2f) = xy•22f. Then, if we divide Z by 2f (or, nearly equivalently, shift it right by f bits), we have xy•2f except for any rounding errors that may have occurred in the division. And xy•2f is the fixed-point representation of the product of x and y.
Thus, we can effect a fixed-point multiplication by perform an integer multiplication followed by a shift.
Often, to get rounding instead of truncation, a value of half of 2f is added before the shift, so we compute floor((XY + 2f−1) / 2f):
Multiply X by Y.
Add 2f−1.
Shift right f bits.
It seems that fixed-point numbers are just convertion for encreese performance.
You might as well say that floating-point numbers are a conversion to increase the representable range.
Whatever format your numbers are originally coming in as (strings, voltage levels, integers, etc.), you often convert them to floating point numbers in order to store or operate on them, but neither floating point nor fixed point is a human-readable representation.
Floating point numbers have lower precision and a wider magnitude range; fixed point numbers have higher precision and a narrower magnitude range. (Performance differences depend on the architecture and the important operations.) You shouldn't think of the fixed-point representation as a conversion from floating point, but as an alternative to floating point.
I think you want a class that wraps an int along with the fixed radix point information. Indeed, the use is implicit, but you then define your own multiplication (for example) that works on the fixed point meaning as a whole rather than just multiplying the underlying ints.
You don't want to leave the implicit meaning ... make it known to the compiler in a strong way. You should not have to explicitly call your handling functions; make it part of the class semantics.

fixed point subtraction for two's complement data

I have some real data. For example +2 and -3. These data are represented in two's complement fixed point with 4 bit binary value where MSB represents the sign bit and number of fractional bit is zero.
So +2 = 0010
-3 = 1101
addition of this two numbers is (+2) + (-3)=-1
(0010)+(1101)=(1111)
But in case of subtraction (+2)-(-3) what should i do?
Is it needed to take the two's complement of 1101 (-3) again and add with 0010?
You can evaluate -(-3) in binary and than simply sums it with the other values.
With two's complement, evaluate the opposite of a number is pretty simple: just apply the NOT binary operation to every digits except for the less significant bit. The equation below uses the tilde to rapresent the NOT operation of a single bit and assumed to deal with integer rapresented by n bits (n = 4 in your example):
In your example (with an informal notation): -(-3) = -(1101) = 0011

Is it possible for "floor" to return an inaccurate result due to floating point rounding error?

I understand that floating point numbers can often include rounding errors.
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
Is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999?
The floor() function returns an floating point value that is an exact integer. So the premise of your question is wrong to begin with.
Now, floor(x) returns the nearest integral value that is not greater than x. It is always true that
floor(x) <= x
and that there exists no integer i, greater than floor(x), such that i <= x.
Looking at floor(3.14159265), this returns 3.0. There's no debate about that. Nothing more to say.
Where it gets interesting is if you write floor(x) where x is the result of an arithmetic expression. Floating point precision and rounding can mean that x falls on the wrong side of an integer. In other words, the true value of the expression that yields x is greater than some integer, i, but that x when evaluated using floating point arithmetic is less than i.
Small integers are representable exactly as floats, but big integers are not.
But, as others pointed out, big integers not representable by float will never be representable by a non-integer, so floor() will never return a non-integer value. Thus, the cast to (int), as long as it does not overflow, will be correct.
But how small is small? Copying shamelessly from this answer:
For float, it is 16,777,217 (224 + 1).
For double, it is 9,007,199,254,740,993 (253 + 1).
Note that the usual range of int (32-bits) is 231, so float is unable to represent all of them exactly. Use double if you need that.
Interestingly, floats can store a certain range of integers exactly, for example:
1 is stored as mantissa 1 (binary 1) * exponent 2^0
2 is stored as mantissa 1 (binary 1) * exponent 2^1
3 is stored as mantissa 1.5 (binary 1.1) * exponent 2^1
4 is stored as mantissa 1 * exponent 2^2
5 is stored as mantissa 1.25 (binary 1.01) * exponent 2^2
6 is stored as mantissa 1.5 (binary 1.1) * exponent 2^2
7 is stored as mantissa 1.75 (binary 1.11) * exponent 2^2
8 is stored as mantissa 1 (binary 1) * exponent 2^3
9 is stored as mantissa 1.125 (binary 1.001) * exponent 2^3
10 is stored as mantissa 1.25 (binary 1.01) * exponent 2^3
...
As you can see, the way exponents increase works in with the perfectly-stored fractional values the mantissa can represent.
You can get a good sense for this by putting number into this great online conversion site.
Once you cross a certain threshold, there's not enough digits in the mantissa to divide the span of the increased exponents without skipping first every odd integer value, then three out of every four, then 7 out of 8 etc.. For numbers over this threshold, the issue is not that they might be different from integer values by some tiny fractional amount, its that all the representable values are integers and not only can no fractional part be represented any more, but as above some of the integers can't be either.
You can observe this in the calculator by considering:
Binary Decimal
+-Exponent Mantissa
0 10010110 11111111111111111111111 16777215
0 10010111 00000000000000000000000 16777216
0 10010111 00000000000000000000001 16777218
See how at this stage, the smallest possible increment of the mantissa is actually "worth 2" in terms of the decimal value represented?
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
It's always exact. What floor is doing is effectively wiping out any '1's in the mantissa whose significance (their contribution to value) is fractional anyway.
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
No.

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

representation of double and radix point

According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?
I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.
32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).