Represent exponential express in terms of bit operators - bit-manipulation

I'm trying to figure out how I would represent
x/2^y -> in terms of binary operators
I know
x >> 1 = = x/2
So my gut is saying that using what I know, there is some more manipulation needed in order to be able to represent. But I've been stuck for a while so I thought I might get some guidance here.

Given that clarification, I think you want x >> y;
REVISED: The above only works for unsigned integers.
I can't figure out how to make it round signed numbers toward zero without using addition. Using signed 32-bit integers, the following works:
((x+(x>>31))>>y)-(x>>31)
This works because x>>31 is -1 for negative numbers, zero for positive numbers.

Related

How to fix the position of binary point in an unsigned N-bit interger?

I am working on developing a fixed point algorithm in C++. I know that, for a N-bit integer, the fixed point binary integer is represented as U(a,b). For example, for an 8 bit Integer (i.e 256 samples), If we represent it in the form U(6,2), it means that the binary point is to the left of the 2nd bit starting from the right of the form:
b5 b4 b3 b2 b1 b0 . b(-1) b(-2)
Thus , it has 6 integer bits and 2 fractional bits. In C++, I know there are some bit shift operators I can use, but they are basically used for shifting the bits of the input stream, my question is, how to define a binary fixed point integer of the form, fix<6,2> or U(6,2). All the major processing operation will be carried out on the fractional part and I am just finding a way to do this fix in C++. Any help regarding this would be appreciated.Thanks!
Example : Suppose I have an input discrete signal with 1024 sample points on x-axis (For now just think this input signal is coming from some sensor). Each of this sample point has a particular amplitude. Say the sample at time 2(x-axis) has an amplitude of 3.67(y-axis). Now I have a variable "int *input;" that takes the sample 2, which in binary is 0000 0100. So basically I want to make this as 00000.100 by performing the U(5,3) on the sample 2 in C++. So that I can perform the interpolation operations on fractions of the input sampling period or time.
PS - I don't want to create a separate class or use external libraries for this. I just want to take each 8 bits from my input signal, perform the U(a,b) fix on it followed by rest of the operations are done on the fractional part.
Short answer: left shift.
Long answer:
Fixed point numbers are stored as integers, usually int, which is the fastest integer type for a particular platform.
Normal integer without fractional bits are usually called Q0, Q.0 or QX.0 where X is the total number of bits of underlying storage type(usually int).
To convert between different Q.X formats, left or right shift. For example, to convert 5 in Q0 to 5 in Q4, left shift it 4 bits, or multiply it by 16.
Usually it's useful to find or write a small fixed point library that does basic calculations, like a*b>>q and (a<<q)/b. Because you will do Q.X=Q.Y*Q.Z and Q.X=Q.Y/Q.Z a lot and you need to convert formats when doing calculations. As you may have observed, using normal * operator will give you Q.(X+Y)=Q.X*Q.Y, so in order to fit the result into Q.Z format, you need to right shift the result by (X+Y-Z) bits.
Division is similar, you get Q.(X-Y)=Q.X*Q.Y form the standard / operator, and to get the result in Q.Z format you shift the dividend before the division. What's different is that division is an expensive operation, and it's not trivial to write a fast one from scratch.
Be aware of double-word support of your platform, it will make your life a lot easier. With double word arithmetic, result of a*b can be twice the size of a or b, so that you don't lose range by doing a*b>>c. Without double word, you have to limit the input range of a and b so that a*b doesn't overflow. This is not obvious when you first start, but soon you will find you need more fractional bits or rage to get the job done, and you will finally need to dig into the reference manual of your processor's ISA.
example:
float a = 0.1;// 0.1
int aQ16 = a*65536;// 0.1 in Q16 format
int bQ16 = 4<<16// 4Q16
int cQ16 = a*b>>16 // result = 0.399963378906250Q16 = 26212,
// not 0.4Q16 = 26214 because of truncating error
If this is your question:
Q. Should I define my fixed-binary-point integer as a template, U<int a, int b>(int number), or not, U(int a, int b)
I think your answer to that is: "Do you want to define operators that take two fixed-binary-point integers? If so make them a template."
The template is just a little extra complexity if you're not defining operators. So I'd leave it out.
But if you are defining operators, you don't want to be able to add U<4, 4> and U<6, 2>. What would you define your result as? The templates will give you a compile time error should you try to do that.

How to inverse a number without using array or any arithmetic operations

How can I reverse a number without using arrays or any arithmetic operations i.e from 85 to 58. Using bitwise operators might be the solution. But what series of operations are needed to reverse a number. I've tried shifting and complementing.
And is there a way to get binary or hexadecimal as input and perform operations on it. Rather than getting int and typecast at printf?
there are alot of api's to do that // but the easiest way use a stack :D or convert it to string then it will be such as array you can inverse it easilly
I found the answer after all. Using the right shift 8 times and left shift 2 times would work on a 2 digit number. And follow the shift bits for different digit numbers,

Conversion Big Integer <-> double in C++

I am writing my own long arithmetic library in C++ for fun and it is already pretty finished, I even implemented several Cryptogrphic algorithms with that library, but one important thing is still missing: I want to convert doubles (and floats/long doubles) into my number and vice versa. My numbers are represented as a variable sized array of unsigned long ints plus a sign bit.
I tried to find the answer with google, but the problem is that people rarely ever implement such things themselves, so I only find things about how to use Java BigInteger etc.
Conceptually, it is rather easy: I take the mantissa, shift it by the number of bits dictated by the exponent and set the sign. In the other direction I truncate it so that it fits into the mantissa and set the exponent depending on my log2 function.
But I am having a hard time to figure out the details, I could either play around with some bit patterns and cast it to a double, but I didn't find an elegant way to achieve that or I could "calculate" it by starting with 2, exponentiate, multiply etc, but that doesn't seem very efficient.
I would appreciate a solution that doesn't use any library calls because I am trying to avoid libraries for my project, otherwise I could just have used gmp, furthermore, I often have two solutions on several other occasions, one using inline assembler which is efficient and one that is more platform independent, so either answer is useful for me.
edit: I use uint64_t for my parts, but I would like to be able to change it depending on the machine, but I am willing to do some different implementations with some #ifdefs to achieve that.
I'm going to make non-portable assumption here: namely, that unsigned long long has more accurate digits than double. (This is true on all modern desktop systems that I know of.)
First, convert the most significant integer(s) into an unsigned long long. Then convert that to a double S. Let M be the number of integers less than those used in that first step. multiply S by(1ull << (sizeof(unsigned)*CHAR_BIT*M). (If shifting more than 63 bits, you will have to split those into seperate shifts and do some alrithmetic) Finally, if the original number was negative you multiply this result by -1.
This rounds a lot, but even with this rounding, due to the above assumption, no digits are lost that wouldn't be lost anyway with the conversion to a double. I think this is a similar process to what Mark Ransom said, but I'm not certain.
For converting from a double to a biginteger, first seperate the mantissa into a double M and the exponent into an int E, using frexp. Multiply M by UNSIGNED_MAX, and store that result in an unsigned R. If std::numeric_limits<double>::radix() is 2 (I don't know if it is or not for x86/x64), you can easily shift R left by E-(sizeof(unsigned)*CHAR_BIT) bits and you're done. Otherwise the result will instead beR*(E**(sizeof(unsigned)*CHAR_BIT)) (where ** means to the power of)
If performance is a concern, you can add an overload to your bignum class for multiplying by std::constant_integer<unsigned, 10>, which simply returns (LHS<<4)+(LHS<<2). You can similarly optimize other constants if you wish.
This blog post might help you Clarifying and optimizing Integer>>asFloat
Otherwise, you can yet have an idea of algorithm with this SO question Converting from unsigned long long to float with round to nearest even
You don't say explicitly, but I assume your library is integer only and the unsigned longs are 32 bit and binary (not decimal). The conversion to double is simple, so I'll tackle that first.
Start with a multiplier for the current piece; if the number is positive it will be 1.0, if negative it will be -1.0. For each of the unsigned long ints in your bignum, multiply by the current multiplier and add it to the result, then multiply your multiplier by pow(2.0, 32) (4294967296.0) for 32 bits or pow(2.0, 64) (18446744073709551616.0) for 64 bits.
You can optimize this process by working with only the 2 most significant values. You need to use 2 even if the number of bits in your integer type is larger than the precision of a double, since the number of used bits in the most significant value might only be 1. You can generate the multiplier by taking a power of 2 to the number of skipped bits, e.g. pow(2.0, most_significant_count*sizeof(bit_array[0])*8). You can't use a bit shift as given in another answer because it will overflow after the first value.
To convert from double, you can get the exponent and mantissa separated from each other with the frexp function. The mantissa will come as a floating point value between 0.5 and 1.0 so you'll want to multiply it by pow(2.0, 32) or pow(2.0, 64) to convert it to an integer, then adjust the exponent by -32 or -64 to compensate.
To go from a big integer to a double, just do it the same way you parse numbers. For example, you parse the number "531" as "1 + (3 * 10) + (5 * 100)". Compute each portion using doubles, starting with the least significant portion.
To go from a double to a big integer, do it the same way but in reverse starting with the most significant portion. So, to convert 531, you first see that it's more than 100 but less than 1000. You find the first digit by dividing by 100. Then you subtract to get the remainder of 31. Then find the next digit by dividing by 10. And so on.
Of course, you won't be using tens (unless you store your big integers as digits). Exactly how you break it apart depends on how your big integer class is constructed. For example, if it's uses 64-bit units, then you'll use powers of 2^64 instead of powers of 10.

multiplication of string [ containing integer], output also stored in string, How? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Inputting large numbers in c++?
Arbitrary-precision arithmetic Explanation
I need to multiply two huge huge integers, like:
a=1212121212121212121212121212121212121212121212121212;
b=1212121212121212121212121212121212121212121212121212;
I think there are no data types in C and C++ to hold this huge an integer, so I thought to keep it as a string format like:-
char *number1="1212121212121212121212121212121212121212121212121212";
char *number2="1212121212121212121212121212121212121212121212121212";
during the time of multiplication I convert it into string with help of atoi() function like:
atoi(number1)*atoi(number2);
As usual the output of this multiplication will be obviously huge, so I need to change the output in string format.
I know there is an itoa() function which converts an integer to a string but it is not compatible with all compilers. Can any body tell me what I should do in this scenario?
I am using Ubuntu-10.04 and the g++ compiler.
Since C and C++ do not offer a native type that supports big numbers, it makes no sense to call atoi() to parse such numbers. atoi() returns a native int which is capped at 2,147,483,647 on 32-bit platforms.
You can use one of the numerous bignum libraries, like GMP for instance.
I think, the best variant besides using some math libraries is to split those numbers into int arrays with some fixed limit. Then just perform multiplication using basic math multiplication methods. And do not forget about overflows.
Multiplying the large numbers is very
difficult, however we can do it by
applying the logarithm of
multiplication of two numbers formula
and now we are going know how we
derived the product of two numbers’
logarithm.
Let us consider a, m and n are positive real numbers but a does not equal to 1 which means ‘a’ belongs to R+ – {1}. Logarithm of m and n to base a are x and y respectively by satisfying ax is equal to m and ay is equal to n condition.
loga (m.n) = x + y
As we already know x = loga m and y = loga n.
loga (m.n) = loga m + loga n
logarithm of multiplication of two values is equal to summation of the same values’ logarithms. The same logarithmic fundamental can now help us in multiplying the two large numbers by adding the logarithm of those values. If you don’t have a calculator, just take the logarithmic table help to perform this.
Using atoi() is also not helpful since the number itself won't fit in integer data type.
You have to simulate the method you did in elementary school.
121
*23
----
363
242*
----
2783
The implementation is left as an exercise. You would also need to know how to add big numbers.

Bit manipulation for big integer classes?

I'm having a problem coming up with an algorithm for a big integer class in C++. My initial idea was using arrays/lists, but it's very inefficient. I then discovered about things like the following class:
http://www.codeproject.com/KB/cpp/CppIntegerClass.aspx
However, I find that approach really confusing. I don't know how to work with bit manipulations, and I barely understood the code. Someone please explain to me how to utilise bit manipulation, how it works, etc. Eventually I would like to create my own big integer class, but I'm barely a novice programmer and I just learned how to use classes.
Basically my question is:
How do I use bit manipulation to create a big integer class? How does it work??
Thanks!
Start by reading up on binary numbers in general. That page shows how the common arithmetic operations (addition, subtraction etc) work on binary numbers, i.e. how the numbers are manipulated bit by bit to get the desired result.
Mapping that into a programming language such as C++ should be pretty straight-forward once you know why there are bit-manipulating operations being used.
In my experience, the most obvious bit-oriented thing needed when implementing something like this is bit testing, to check for overflow. Let's say you represent your big binary number as an array of uint16_t, i.e. chunks of 16 bits. When implementing addition, you will start at the least significant end of both numbers, and add those. If the sum is larger than 65,535, you need to "carry" one to the next uint16_t, just as when you add decimal numbers one digit at a time.
This can be implemented with a test like so:
const uint16_t *number1;
const uint16_t *number2;
/* assume code goes here to set up the number1 and number2 pointers. */
/* Compute sum of 16 bits. */
uint16_t carry = 0;
uint32_t sum = number1[0] + number2[0];
/* One way of testing for overflow: */
if (sum & (1 << 16))
carry = 1;
Here, the 1 << 16 expressions creates a mask by shifting a 1 sixteen steps to the left. The & bitwise and operator tests the sum against the mask; the result will be non-zero (i.e. true, in C++) if bit 16 is set in sum.