bitmasking and binary arithmetic - c++

I'd like to know the science behind the following. a 32 bit value is shifted left 32 times in a 64 bit type, then a division is performed. somehow the precision is contained within the last 32 bits and in order to retrieve the value as a floating point number, I can multiply by 1 over the max value of an unsigned 32 bit int.
phase = ((uint64) 44100 << 32) / 48000;
(phase & 0xffffffff) * (1.0f / 4294967296.0f);// == 0.918749988
the same as
(float)44100/48000;// == 0.918749988

(...)
If you lose precision when dividing two integer numbers, you should remember the remainder.
The reminder in C++ can be taken by doing 44100%48000 in your case.
Actually these are constants and it's completely clear that 44100/48000 == 0, so remainder is all you have.
Well, the reminder will even be -- guess what -- 44100!
The float type (imposed by the explicit cast) has only 6 significant digits. So 4294967296.0f will be simply 429496e4 (in mathematics: 429496*10^4). That's why this type isn't valuable for anything but playing around.
The best way to get a value of fixed integer type in which all bits are set, and not miss the correct number of 'f' in 0xfffff, is to use the ~ operator and 0 value. In your case, ~uint32_t(0).
Well, I should have said this in the beginning: 44100.0/48000 should give you the result you want. :P

this is the answer I was looking for
bit shifting left will provide that number of bits in which to store the precision vale from a division.
dividing the integer value represented by these bits by 2 to the power of the bit shift amount will return the precision value
e.g
0000 0001 * 2^8 = 1 0000 0000 = 256(base 10)
1 0000 0000 / 2 = 1000 0000 = 128(base 10)
128 / 2^8 = 0.5

Related

How type conversion is done in the following code?

In the below code, the variable Speed is of type int. How is it stored in two variables of char type? I also don't understand the comment // 16 bits - 2 x 8 bits variables.
Can u explain me with example for the type conversion because when I run the code it shows symbols after type conversion
AX12A::turn(unsigned char ID, bool SIDE, int Speed)
{
if (SIDE == LEFT)
{
char Speed_H,Speed_L;
Speed_H = Speed >> 8;
Speed_L = Speed; // 16 bits - 2 x 8 bits variables
}
}
main(){
ax12a.turn(ID,Left,200)
}
It seems like on your platform, a variable of type int is stored on 16 bits and a variable of type char is stored on 8 bits.
This does not always happen, as the C++ standard does not guarantee the size of these types. I made my assumption based on the code and the comment. Use data types of fixed size, such as the ones described here, to make sure this assumption is always going to be true.
Both int and char are integral types. When converting from a larger integral type to a smaller integral type (e.g. int to char), the most significant bits are discarded, and the least significant bits are kept (in this case, you keep the last 8 bits).
Before fully understanding the code, you also need to know about right shift. This simply moves the bits to the right (for the purpose of this answer, it does not matter what is inserted to the right). Therefore, the least significant bit (the rightmost bit) is discarded, every other bit is moved one space to the right. Very similar to division by 10 in the decimal system.
Now, you have your variable Speed, which has 16 bits.
Speed_H = Speed >> 8;
This shifts Speed with 8 bits to the right, and then assigns the 8 least significant bits to Speed_H. This basically means that you will have in Speed_H the 8 most significant bits (the "upper" half of Speed).
Speed_L = Speed;
Simply assigns to Speed_L the least significant 8 bits.
The comment basically states that you split a variable of 16 bits into 2 variables of 8 bits, with the first (most significant) 8 bits being stored in Speed_H and the last (least significant) 8 bits being stored in Speed_L.
From your code I understand that sizeof(int) = 2 bytes in your case.
Let us take example as shown below.
int my_var = 200;
my_var is allocated 2 bytes of memory address because datatype is ‘int’.
value assigned to my_var is 200.
Note that 200 decimal = 0x00C8 Hexadecimal = 0000 0000 1100 1000 binary
Higher byte 0000 0000 binary is stored in one of the addresses allocated to my_var
And lower byte 1100 1000 is stored in other address depending on endianness.
To know about endianness, check this link
https://www.geeksforgeeks.org/little-and-big-endian-mystery/
In your code :
int Speed = 200;
Speed_H = Speed >> 8;
=> 200 decimal value right shifted 8 times
=> that means 0000 0000 1100 1000 binary value right shifted by 8 bits
=> that means Speed_H = 0000 0000 binary
Speed_L = Speed;
=> Speed_L = 200;
=> Speed_L = 0000 0000 1100 1000 binary
=> Speed_L is of type char so it can accommodate only one byte
=> The value 0000 0000 1100 1000 will be narrowed (in other words "cut-off") to least significant byte and assigned to Speed_L.
=> Speed_L = 1100 1000 binary = 200 decimal

fixed point subtraction for two's complement data

I have some real data. For example +2 and -3. These data are represented in two's complement fixed point with 4 bit binary value where MSB represents the sign bit and number of fractional bit is zero.
So +2 = 0010
-3 = 1101
addition of this two numbers is (+2) + (-3)=-1
(0010)+(1101)=(1111)
But in case of subtraction (+2)-(-3) what should i do?
Is it needed to take the two's complement of 1101 (-3) again and add with 0010?
You can evaluate -(-3) in binary and than simply sums it with the other values.
With two's complement, evaluate the opposite of a number is pretty simple: just apply the NOT binary operation to every digits except for the less significant bit. The equation below uses the tilde to rapresent the NOT operation of a single bit and assumed to deal with integer rapresented by n bits (n = 4 in your example):
In your example (with an informal notation): -(-3) = -(1101) = 0011

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Why perform multiplication in this way?

I've run into this function:
static inline INT32 MPY48SR(INT16 o16, INT32 o32)
{
UINT32 Temp0;
INT32 Temp1;
// A1. get the lower 16 bits of the 32-bit param
// A2. multiply them with the 16-bit param
// A3. add 16384 (TODO: why?)
// A4. bitshift to the right by 15 (TODO: why 15?)
Temp0 = (((UINT16)o32 * o16) + 0x4000) >> 15;
// B1. Get the higher 16 bits of the 32-bit param
// B2. Multiply them with the 16-bit param
Temp1 = (INT16)(o32 >> 16) * o16;
// 1. Shift B to the left (TODO: why do this?)
// 2. Combine with A and return
return (Temp1 << 1) + Temp0;
}
The inline comments are mine. It seems that all it's doing is multiplying the two arguments. Is this right, or is there more to it? Why would this be done in such a way?
Those parameters don't represent integers. They represent real numbers in fixed-point format with 15 bits to the right of the radix point. For instance, 1.0 is represented by 1 << 15 = 0x8000, 0.5 is 0x4000, -0.5 is 0xC000 (or 0xFFFFC000 in 32 bits).
Adding fixed-point numbers is simple, because you can just add their integer representation. But if you want to multiply, you first have to multiply them as integers, but then you have twice as many bits to the right of the radix point, so you have to discard the excess by shifting. For instance, if you want to multiply 0.5 by itself in 32-bit format, you multiply 0x00004000 (1 << 14) by itself to get 0x10000000 (1 << 28), then shift right by 15 bits to get 0x00002000 (1 << 13). To get better accuracy, when you discard the lowest 15-bits, you want to round to the nearest number, not round down. You can do this by adding 0x4000 = 1 << 14. Then if the discarded 15 bits is less than 0x4000, it gets rounded down, and if it's 0x4000 or more, it gets rounded up.
(0x3FFF + 0x4000) >> 15 = 0x7FFF >> 15 = 0
(0x4000 + 0x4000) >> 15 = 0x8000 >> 15 = 1
To sum up, you can do the multiplication like this:
return (o32 * o16 + 0x4000) >> 15;
But there's a problem. In C++, the result of a multiplication has the same type as its operands. So o16 is promoted to the same size as o32, then they are multiplied to get a 32-bit result. But this throws away the top bits, because the product needs 16 + 32 = 48 bits for accurate representation. One way to do this is to cast the operands to 64 bits and then multiply, but that might be slower, and it's not supported on all machines. So instead it breaks o32 into two 16-bit pieces, then does two multiplications in 32-bits, and combines the results.
This implements multiplication of fixed-point numbers. The numbers are viewed as being in the Q15 format (having 15 bits in the fractional part).
Mathematically, this function calculates (o16 * o32) / 2^15, rounded to nearest integer (hence the 2^14 factor, which represents 1/2, added to a number in order to round it). It uses unsigned and signed 16-bit multiplications with 32-bit result, which are presumably supported by the instruction set.
Note that there exists a corner case, where each of the numbers has a minimal value (-2^15 and -2^31); in this case, the result (2^31) is not representable in the output, and gets wrapped over (becomes -2^31 instead). For all other combinations of o16 and o32, the result is correct.

representation of double and radix point

According to what I know on double (IEEE standard) there is one bit for signus, 54 bits for mantissa, a base and some bits for exponent
the formula to get the double is : (−1)^s × c × b^q
Maybe I made some mistake but the idea is here.
I'm just wondering how we can know where to put the radix point with this formula.
If i take number, I get for instance:
m = 3
q = 4
s = 2
b = 2
(-1)^2 * 4 * 2^3 = 32
but I don't know where to put some radix point..
What is wrong here ?
EDIT:
Maybe q is always negative ?
I guess a look at the Wikipedia would've helped.
Thing is, that there is a "hidden" '1.' in the IEEE formula.
Every IEEE 754 number has to be normlized, this means that the encoded number is in the format:
(-1)^(sign) * '1.' (mantissa) * 2^(exponent)
Therefore, you have encoded 1.32, not 32.
32 = 1 * 2^5, so mantissa=1, exponent=5, sign=0. We need to add 1023 to exponent when coding the exponent, so below we have 1023+5=1028. Also we need to remove digit 1 when coding mantissa, so that 1.(whatever) becomes (whatever)
Hexadecimal representation of 32 as 64-bit double is 4040000000000000, or binary:
0100 0000 0100 0000 0000 ... and zeros all the way down
^======== start of mantissa (coded 0, interpreted 1.0)
^===========^---------- exponent (coded 1028, interpreted 5)
^----------------------- sign (0)
To verify the result visit this page, enter 32 in first field, and click either Rounded or Not Rounded button (doesn't matter which one).