Why is the result of a bitwise shift unrecoverable if there is a mathematical equivalent of the same operation? - bit-manipulation

Take for example the number 91. That number in binary is 1011011. If you shift that number to the right by 5 bits, you would get 2 (10 in binary). According to a google search, bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2 to the power of the number of bits to be shifted, respectively. so to get from 91 to 2 by bit shifting, the equation would look like this: 91 / 2^5, which is also 91 / 32. Now, of course if you did that in your calculator, there would be some decimal values, which aren't included when bit shifting. The resulting 2 is actually 2.84357. I'm sure you know that if you do a certain operation on a number and then you do the inverse, the result would be what you had in the first place. So does decimal precision have something to do with this?

There is a mathematical equivalent of shifting to the right... and the mathematical operation is UNRECOVERABLE.
You seem to think that shifting to the right is:
bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2
This is what you will hear people casually say, but it is only half right. As it it is not the same but only similar.
The correct statement is:
shifting a base-2 number one digit to the right is THE SAME as dividing by two in the integer domain
If you have an integer calculator, if you did 91/32 you will get 2. You will not get ANY decimal point because we are operating in the integer domain.
For real numbers, the equivalent operation is:
FLOOR(91/32)
Which is also unrecoverable because it also results in 2.
The lesson here is be careful when listening to what people CASUALLY say. Casual speech is often imprecise and assumes the listener is familiar with the subject. You need to dig deeper what the statement is actually trying to say.
As for why it is unrecoverable? Division of integers give two results: the quotient (which is the main result) and the remainder. When we divide 91 by 32 we are doing this:
2
_____
32 ) 91
64
__
27
So we get the result of 2 and a remainder of 27. The reason you can't get 91 by multiplying 2*32 is because we threw away the remainder.
You can get the result back if you saved the remainder. However, calculating the remainder is not a matter of simple shifts. Here's an example of how to make it reversable in C:
int test () {
int a = 91;
int b = 32;
int result;
int remainder;
result = a / b; // result will be 2
remainder = a % b; // remainder will be 27
return (result * b) + remainder; // returns 91
}

You can only recover the result of an operation if it has a 1-1 mapping between the inputs and outputs, i.e. it has an inverse function. But not all mathematical functions have an inverse function
For example if f(x) = x >> n with >> is the shift operator then it'll be equivalent to
f(x) = ⌊x/2n⌋
with ⌊ ⌋ being the floor function. Since there are many inputs that lead to the same output, the relationship isn't 1-1 and there can't be an inverse function for it. This function works the same for both signed and unsigned right shift:
91 >> 5 == floor(91.0/32.0) == 2
-91 >> 5 == floor(-91.0/32.0) == -3
Similarly for an unsigned left shift function g(x) = x << n then the equivalent is
g(x) = (x * 2n) mod 2N
with N being the size in bits of x, because integer math in hardware, C and many other languages always reduce modulo 2N due to the limit of register size and the use of two's complement. And it's clear that the modulo function also isn't invertible/recoverable. The signed left shift is almost the same with some small modifications

Related

How to connect the theory of fixed-point numbers and its practical implementation?

The theory of fixed-point number is that we divide certain number of bits between integer part and fractional part. This amount is fixed.
For example, 26.5 is stored in that order:
To convert from floating-point to fixed-point, we follow this algorithm:
Calculate x = floating_input * 2^(fractional_bits)
27.3 * 2^10 = 27955.2
Round x to the nearest whole number (e.g. round(x))
27955
Store the rounded x in an integer container
Now if we look on the bit representation of our numbers and on what multiplying on 2^(fractional_bits) makes, we will see:
27 is 11011
27*2^10 is 110 1100 0000 0000 which is shifting on 10 bits to the left.
So we can say, that multiplying on 2^10 indeed gives us "space" in the right part of bits for save forth altering of this number. We can make two such numbers converted in this way, interacting each other and eventually re-converted to familiar view with point by opposite dividing on 2^10.
If we recall that bits are stored in some integer variable, which in turn has its own amount of bits it gets clear that as more bits in that variable are devoted for fraction part as less bits remain for integer part of number.
27.3 * 2^10 = 27955.2 should be rounded for storing in integer type to
27955 which is 110 1101 0011 0011
after that number can be altered somehow, certain value isn't important now, and let's say, we want to retrieve back human-readable value:
27955/2^10 = 27,2998046875
What about amount of bits after point?
Let's say we have two numbers with purpose to multiply them and we chose 10 bits after point
27 * 3.3 = 89.1 expected
27*2^10 = 27 648 is 110 1100 0000 0000
3.3*2^10 = 3 379 is 1101 0011 0011
27 648 * 3 379 = 93 422 592
consequently
27*3.3 = 93 422 592/(2^10*2^10) = 89.09 pretty accurate
Let's take 1 bit after point
27 and 3.3
27*2^1 = 54 is 110110
3.3*2^1 = 6.6 after round 6 is 110
54 * 6 = 324
consequently
27*3.3 = 324/(2^1*2^1) = 81 which is unsatisfying
On practice we can use next code to create and operate with fixed-point number:
#include <iostream>
using namespace std;
const int scale = 10;
#define DoubleToFixed(x) (x*(double)(1<<scale))
#define FixedToDouble(x) ((double)x / (double)(1<<scale))
#define IntToFixed(x) (x << scale)
#define FixedToInt(x) (x >> scale)
#define MUL(x,y) (((x)*(y)) >> scale)
#define DIV(x,y) ((x) << scale)
int main()
{
double a = 7.27;
double b = 3.0;
int f = DoubleToFixed(a);
cout << f<<endl; //7444
cout << FixedToDouble(f)<<endl; //7.26953125
int g = DoubleToFixed(b);
cout << g<<endl; //3072
int c = MUL(f,g);
cout << FixedToDouble(c)<<endl; //21.80859375
}
So, where is connection between the theory of fixed emplacement of point between bits (powers of 2) and practice implementation? If we store fixed-number in int, it is obvious, that there is no place for storing the point in it.
It seems that fixed-point numbers are just conversion for increase performance. And to retrieve human-readable number after calculations, the opposite conversion must present.
I hope, I understand the algorithm. But is the idea of placement of point between digits is just an abstract idea?
Fixed-point formats are used as a way to represent fractional numbers. Quite commonly, processors perform fixed-point or integer arithmetic faster or more efficiently than floating-point arithmetic. Whether fixed-point arithmetic is suitable for an application depends on what numbers the application needs to work with.
Using fixed-point formats does require converting input to the fixed-point format and converting numbers in the fixed-point format to output. But this is also true of integers and floating-point: All input must be converted to whatever internal format is used to represent it, and all output must be produced by converting from internal formats.
And how does multiplying on 2^(fractional_bits) affect the quantity of digits after the point?
Suppose we have some number x that is represented as an integer X = x•2f, where f is the number of fraction bits. Conceptually X is in a fixed-point format. Similarly, we have y represented as Y = y•2f.
If we execute an integer multiplication instruction to produce result Z = XY, then Z = XY = (x•2f)•(y•2f) = xy•22f. Then, if we divide Z by 2f (or, nearly equivalently, shift it right by f bits), we have xy•2f except for any rounding errors that may have occurred in the division. And xy•2f is the fixed-point representation of the product of x and y.
Thus, we can effect a fixed-point multiplication by perform an integer multiplication followed by a shift.
Often, to get rounding instead of truncation, a value of half of 2f is added before the shift, so we compute floor((XY + 2f−1) / 2f):
Multiply X by Y.
Add 2f−1.
Shift right f bits.
It seems that fixed-point numbers are just convertion for encreese performance.
You might as well say that floating-point numbers are a conversion to increase the representable range.
Whatever format your numbers are originally coming in as (strings, voltage levels, integers, etc.), you often convert them to floating point numbers in order to store or operate on them, but neither floating point nor fixed point is a human-readable representation.
Floating point numbers have lower precision and a wider magnitude range; fixed point numbers have higher precision and a narrower magnitude range. (Performance differences depend on the architecture and the important operations.) You shouldn't think of the fixed-point representation as a conversion from floating point, but as an alternative to floating point.
I think you want a class that wraps an int along with the fixed radix point information. Indeed, the use is implicit, but you then define your own multiplication (for example) that works on the fixed point meaning as a whole rather than just multiplying the underlying ints.
You don't want to leave the implicit meaning ... make it known to the compiler in a strong way. You should not have to explicitly call your handling functions; make it part of the class semantics.

What is the purpose of "int mask = ~0;"?

I saw the following line of code here in C.
int mask = ~0;
I have printed the value of mask in C and C++. It always prints -1.
So I do have some questions:
Why assigning value ~0 to the mask variable?
What is the purpose of ~0?
Can we use -1 instead of ~0?
It's a portable way to set all the binary bits in an integer to 1 bits without having to know how many bits are in the integer on the current architecture.
C and C++ allow 3 different signed integer formats: sign-magnitude, one's complement and two's complement
~0 will produce all-one bits regardless of the sign format the system uses. So it's more portable than -1
You can add the U suffix (i.e. -1U) to generate an all-one bit pattern portably1. However ~0 indicates the intention clearer: invert all the bits in the value 0 whereas -1 will show that a value of minus one is needed, not its binary representation
1 because unsigned operations are always reduced modulo the number that is one greater than the largest value that can be represented by the resulting type
That on a 2's complement platform (that is assumed) gives you -1, but writing -1 directly is forbidden by the rules (only integers 0..255, unary !, ~ and binary &, ^, |, +, << and >> are allowed).
You are studying a coding challenge with a number of restrictions on operators and language constructions to perform given tasks.
The first problem is return the value -1 without the use of the - operator.
On machines that represent negative numbers with two's complement, the value -1 is represented with all bits set to 1, so ~0 evaluates to -1:
/*
* minusOne - return a value of -1
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 2
* Rating: 1
*/
int minusOne(void) {
// ~0 = 111...111 = -1
return ~0;
}
Other problems in the file are not always implemented correctly. The second problem, returning a boolean value representing the fact the an int value would fit in a 16 bit signed short has a flaw:
/*
* fitsShort - return 1 if x can be represented as a
* 16-bit, two's complement integer.
* Examples: fitsShort(33000) = 0, fitsShort(-32768) = 1
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 8
* Rating: 1
*/
int fitsShort(int x) {
/*
* after left shift 16 and right shift 16, the left 16 of x is 00000..00 or 111...1111
* so after shift, if x remains the same, then it means that x can be represent as 16-bit
*/
return !(((x << 16) >> 16) ^ x);
}
Left shifting a negative value or a number whose shifted value is beyond the range of int has undefined behavior, right shifting a negative value is implementation defined, so the above solution is incorrect (although it is probably the expected solution).
Loooong ago this was how you saved memory on extremely limited equipment such as the 1K ZX 80 or ZX 81 computer. In BASIC, you would
Let X = NOT PI
rather than
LET X = 0
Since numbers were stored as 4 byte floating points, the latter takes 2 bytes more than the first NOT PI alternative, where each of NOT and PI takes up a single byte.
There are multiple ways of encoding numbers across all computer architectures. When using 2's complement this will always be true:~0 == -1. On the other hand, some computers use 1's complement for encoding negative numbers for which the above example is untrue, because ~0 == -0. Yup, 1s complement has negative zero, and that is why it is not very intuitive.
So to your questions
the ~0 is assigned to mask so all the bits in mask are equal 1 -> making mask & sth == sth
the ~0 is used to make all bits equal to 1 regardless of the platform used
you can use -1 instead of ~0 if you are sure that your computer platform uses 2's complement number encoding
My personal thought - make your code as much platform-independent as you can. The cost is relatively small and the code becomes fail proof

counting the number of bit required to represent an integer in 2's complement

I have to write a function that count the number of bit required to represent an int in 2's complement form. The requirement:
1. can only use: ! ~ & ^ | + << >>
2. no loops and conditional statement
3. at most, 90 operators are used
currently, I am thinking something like this:
int howManyBits(int x) {
int mostdigit1 = !!(0x80000000 & x);
int mostdigit2 = mostdigit1 | !!(0x40000000 & x);
int mostdigit3 = mostdigit2 | !!(0x20000000 & x);
//and so one until it reach the least significant digit
return mostdigit1+mostdigit2+...+mostdigit32+1;
}
However, this algorithm doesn't work. it also exceed the 90 operators limit. any suggestion, how can I fix and improve this algorithm?
With 2's complement integers, the problem are the negative numbers. A negative number is indicated by the most significant bit: If it is set, the number is negative.
The negative of a 2's complement integer n is defined as -(1's complement of n)+1.
Thus, I would first test for the negative sign. If it is set, the number of bits required is simply the number of bits available to represent an integer, e.g. 32 bits. If not, you can simply count the number of bits required by shifting repeatedly n by one bit right, until the result is zero. If n, e.g., would be +1, e.g. 000…001, you had to shift it once right to make the result zero, e.g. 1 times. Thus you need 1 bit to represent it.

Why perform multiplication in this way?

I've run into this function:
static inline INT32 MPY48SR(INT16 o16, INT32 o32)
{
UINT32 Temp0;
INT32 Temp1;
// A1. get the lower 16 bits of the 32-bit param
// A2. multiply them with the 16-bit param
// A3. add 16384 (TODO: why?)
// A4. bitshift to the right by 15 (TODO: why 15?)
Temp0 = (((UINT16)o32 * o16) + 0x4000) >> 15;
// B1. Get the higher 16 bits of the 32-bit param
// B2. Multiply them with the 16-bit param
Temp1 = (INT16)(o32 >> 16) * o16;
// 1. Shift B to the left (TODO: why do this?)
// 2. Combine with A and return
return (Temp1 << 1) + Temp0;
}
The inline comments are mine. It seems that all it's doing is multiplying the two arguments. Is this right, or is there more to it? Why would this be done in such a way?
Those parameters don't represent integers. They represent real numbers in fixed-point format with 15 bits to the right of the radix point. For instance, 1.0 is represented by 1 << 15 = 0x8000, 0.5 is 0x4000, -0.5 is 0xC000 (or 0xFFFFC000 in 32 bits).
Adding fixed-point numbers is simple, because you can just add their integer representation. But if you want to multiply, you first have to multiply them as integers, but then you have twice as many bits to the right of the radix point, so you have to discard the excess by shifting. For instance, if you want to multiply 0.5 by itself in 32-bit format, you multiply 0x00004000 (1 << 14) by itself to get 0x10000000 (1 << 28), then shift right by 15 bits to get 0x00002000 (1 << 13). To get better accuracy, when you discard the lowest 15-bits, you want to round to the nearest number, not round down. You can do this by adding 0x4000 = 1 << 14. Then if the discarded 15 bits is less than 0x4000, it gets rounded down, and if it's 0x4000 or more, it gets rounded up.
(0x3FFF + 0x4000) >> 15 = 0x7FFF >> 15 = 0
(0x4000 + 0x4000) >> 15 = 0x8000 >> 15 = 1
To sum up, you can do the multiplication like this:
return (o32 * o16 + 0x4000) >> 15;
But there's a problem. In C++, the result of a multiplication has the same type as its operands. So o16 is promoted to the same size as o32, then they are multiplied to get a 32-bit result. But this throws away the top bits, because the product needs 16 + 32 = 48 bits for accurate representation. One way to do this is to cast the operands to 64 bits and then multiply, but that might be slower, and it's not supported on all machines. So instead it breaks o32 into two 16-bit pieces, then does two multiplications in 32-bits, and combines the results.
This implements multiplication of fixed-point numbers. The numbers are viewed as being in the Q15 format (having 15 bits in the fractional part).
Mathematically, this function calculates (o16 * o32) / 2^15, rounded to nearest integer (hence the 2^14 factor, which represents 1/2, added to a number in order to round it). It uses unsigned and signed 16-bit multiplications with 32-bit result, which are presumably supported by the instruction set.
Note that there exists a corner case, where each of the numbers has a minimal value (-2^15 and -2^31); in this case, the result (2^31) is not representable in the output, and gets wrapped over (becomes -2^31 instead). For all other combinations of o16 and o32, the result is correct.

Bit shifts with ABAP

I'm trying to port some Java code, which requires arithmetic and logical bit shifts, to ABAP.
As far as I know, ABAP only supports the bitwise NOT, AND, OR and XOR operations.
Does anyone know another way to implement these kind of shifts with ABAP? Is there perhaps a way to get the same result as the shifts, by using just the NOT, AND, OR and XOR operations?
Disclaimer: I am not specifically familiar with ABAP, hence this answer is given on a more general level.
Assuming that what you said is true (ABAP doesn't support shifts, which I somewhat doubt), you can use multiplications and divisions instead.
Logical shift left (LSHL)
Can be expressed in terms of multiplication:
x LSHL n = x * 2^n
For example given x=9, n=2:
9 LSHL 2 = 9 * 2^2 = 36
Logical shift right (LSHR)
Can be expressed with (truncating) division:
x LSHR n = x / 2^n
Given x=9, n=2:
9 LSHR 2 = 9 / 2^2 = 2.25 -> 2 (truncation)
Arithmetic shift left (here: "ASHL")
If you wish to perform arithmetic shifts (=preserve sign), we need to further refine the expressions to preserve the sign bit.
Assuming we know that we are dealing with a 32-bit signed integer, where the highest bit is used to represent the sign:
x ASHL n = ((x AND (2^31-1)) * 2^n) + (x AND 2^31)
Example: Shifting Integer.MAX_VALUE to left by one in Java
As an example of how this works, let us consider that we want to shift Java's Integer.MAX_VALUE to left by one. Logical shift left can be represented as *2. Consider the following program:
int maxval = (int)(Integer.MAX_VALUE);
System.out.println("max value : 0" + Integer.toBinaryString(maxval));
System.out.println("sign bit : " + Integer.toBinaryString(maxval+1));
System.out.println("max val<<1: " + Integer.toBinaryString(maxval<<1));
System.out.println("max val*2 : " + Integer.toBinaryString(maxval*2));
The program's output:
max value : 01111111111111111111111111111111 (2147483647)
sign bit : 10000000000000000000000000000000 (-2147483648)
max val<<1: 11111111111111111111111111111110 (-2)
max val*2 : 11111111111111111111111111111110 (-2)
The result is negative due that the highest bit in integer is used to represent sign. We get the exact number of -2, because of the way negative numbers are represents in Java (for details, see for instance http://www.javabeat.net/qna/30-negative-numbers-and-binary-representation-in/).
Edit: the updated code can now be found over here: github gist