Emulated Fixed Point Division/Multiplication - c++

I'm writing a Fixedpoint class, but have ran into bit of a snag... The multiplication, division portions, I am not sure how to emulate. I took a very rough stab at the division operator but I am sure it's wrong. Here's what it looks like so far:
class Fixed
{
Fixed(short int _value, short int _part) :
value(long(_value + (_part >> 8))), part(long(_part & 0x0000FFFF)) {};
...
inline Fixed operator -() const // example of some of the bitwise it's doing
{
return Fixed(-value - 1, (~part)&0x0000FFFF);
};
...
inline Fixed operator / (const Fixed & arg) const // example of how I'm probably doing it wrong
{
long int tempInt = value<<8 | part;
long int tempPart = tempInt;
tempInt /= arg.value<<8 | arg.part;
tempPart %= arg.value<<8 | arg.part;
return Fixed(tempInt, tempPart);
};
long int value, part; // members
};
I... am not a very good programmer, haha!
The class's part is 16 bits wide (but expressed as a 32-bit long since I imagine it'd need the room for possible overflows before they're fixed) and the same goes for value which is the integer part. When the 'part' goes over 0xFFFF in one of it's operations, the highest 16 bits are added to 'value', and then the part is masked so only it's lowest 16 bits remain. That's done in the init list.
I hate to ask, but if anyone would know where I could find documentation for something like this, or even just the 'trick' or how to do those two operators, I would be very happy for it! I am a dimwit when it comes to math, and I know someone has had to do/ask this before, but searching google has for once not taken me to the promised land...

As Jan says, use a single integer. Since it looks like you're specifying 16 bit integer and fractional parts, you could do this with a plain 32 bit integer.
The "trick" is to realise what happens to the "format" of the number when you do operations on it. Your format would be described as 16.16. When you add or subtract, the format stays the same. When you multiply, you get 32.32 -- So you need a 64 bit temporary value for the result. Then you do a >>16 shift to get down to 48.16 format, then take the bottom 32 bits to get your answer in 16.16.
I'm a little rusty on the division -- In DSP, where I learned this stuff, we avoided (expensive) division wherever possible!

I'd recommend using one integer value instead of separate whole and fractional part. Than addition and subtraction are the integeral counterparts directly and you can simply use 64-bit support, which all common compilers have these days:
Multiplication:
operator*(const Fixed &other) const {
return Fixed((int64_t)value * (int64_t)other.value);
}
Division:
operator/(const Fixed &other) const {
return Fixed(((int64_t)value << 16) / (int64_t)other.value);
}
64-bit integers are
On gcc, stdint.h (or cstdint, which places them in std:: namespace) should be available, so you can use the types I mentioned above. Otherwise it's long long on 32-bit targets and long on 64-bit targets.
On Windows, it's always long long or __int64.

To get things up and running, first implement the (unary) inverse(x) = 1/x, and then implement a/b as a*inverse(b). You'll probably want to represent the intermediates as a 32.32 format.

Related

Converting 2's complement of values back and forth in C++ by calculation or casting?

I get some values from hardware registers where values are stored in 16-bit unsigned integers but these values are actually signed. Knowing the last bit is the sign bit, a colleague has done the following snippet to convert them to 2's complement values :
/* Take 15 bits of the data (last bit is the sign) */
#define DATAMASK 0x7FFF
/* Sign is bit 15 (starting from zero) with the 15 bit data */
#define SIGNMASK 0x8000
#define SIGNBIT 15
int16_t calc2sComplement(uint16_t data)
{
int16_t temp, sign;
int16_t signData;
sign = (int16_t)((data & SIGNMASK) >> SIGNBIT);
if (sign)
{
temp = (~data) & DATAMASK;
signData = (short)(temp * -1);
}
else
{
temp = (data & DATAMASK);
signData = temp;
}
return(signData);
}
As far as I know, unsigned integers types and signed integers types only differs by their type and the meaning of the last bit ; so casting such as following should work as well :
int16_t calc2sComplement(uint16_t data)
{
return(static_cast<int16_t>(data));
}
and when needing to push values to the hardware, the reverse operation is straightforward, unlike the calculation. The advantage of the former solution is it's toolchain-free ; since it can change sooner or later (gcc 4.4.7, and so C++03), I would prefer not having to do it but there won't be any regression when compiled years after. The advantage of the latter is it's more readable, close to standard and avoid unnecessary operations.
What would be the best in my case to be sure to keep the same behaviour if compiled again after a toolchain change (even the standard types are redefined somewhere in the toolchain and I do not really have the hand on it) ?
If you would keep the first solution, how would improve it and/or code the reverse conversion (keep in mind that data can be a pointer over a buffer of data) ?
In the end, let's answer to myself. So, the best way to convert values to or from two's complement, and preventing any unexpected behaviour is to perform two's complement conversion as follows :
int16_t calc2sComplement(uint16_t data)
{
return(static_cast<int16_t>(data));
}
and to do the reverse operation :
uint16_t inv2sComplement(int16_t data)
{
return(static_cast<uint16_t>(data));
}
This method is proven to be completely safe (as long as primitives types are not redefined somewhere in the toolchain - which is considered bad practice but was actually my case, hence my question in the 1st place) by relying on the definition of the primitive built-in types.

Print the integral value of a very long binary representation

Let's say you have a very long binary-word (>64bit), which represents an unsigned integral value, and you would like to print the actual number. We're talking C++, so let's assume you start off with a bool[ ] or std::vector<bool> or a std::bitset, and end up with a std::string or some kind of std::ostream - whatever your solution prefers. But please only use the core-language and STL.
Now, i suspected, you must evaluate it chunkwise, to have some intermediate results, that are small enough to store away - preferably base 10, as in x·10k. I could figure out to assemble the number from that point. But since there is no chunk-width that corresponds to the base of 10, I don't know how to do it. Of course, you can start with any other chunk-width, let's say 3, to get intermediates in the form of x·(23)k, and then convert it to base 10, but this will lead to x·103·k·lg2 which obviously has a floating-point exponent, that isn't of any help.
Anyway, I'm exhausted of this math-crap and I would appreciate a thoughtful suggestion.
Yours sincerely,
Armin
I'm going to assume you already have some sort of bignum division/modulo function to work with, because implementing such a thing is a complete nightmare.
class bignum {
public:
bignum(unsigned value=0);
bignum(const bignum& rhs);
bignum(bignum&& rhs);
void divide(const bignum& denominator, bignum& out_modulo);
explicit operator bool();
explicit operator unsigned();
};
std::ostream& operator<<(std::ostream& out, bignum value) {
std::string backwards;
bignum remainder;
do {
value.divide(10, remainder);
backwards.push_back(unsigned(remainder)+'0');
}while(value);
std::copy(backwards.rbegin(), backwards.rend(), std::ostream_iterator(out));
return out;
}
If rounding is an option, it should be fairly trivial to convert most bignums to double as well, which would be a LOT faster. Namely, copy the 64 most significant bits to an unsigned long, convert that to a double, and then multiply by 2.0 to the power of the number of significant bits minus 64. (I say significant bits, because you have to skip any leading zeros)
So if you have 150 significant bits, copy the top 64 into an unsigned long, convert that to a double, and multiply that by std::pow(2.0, 150-64) ~ 7.73e+25 to get the result. If you only have 40 significant bits, pad with zeros on the right it still works. copy the 40 bits to the MSB of an unsigned long, convert that to a double, and multiply that by std::pow(2.0, 40-64) ~ 5.96e-8 to get the result!
Edit
Oli Charlesworth posted a link to the wikipedia page on Double Dabble which blows the first algorithm I showed out of the water. Don't I feel silly.

How does the compiler implement bit field arithmetics?

When asking a question on how to do wrapped N bit signed subtraction I got the following answer:
template<int bits>
int
sub_wrap( int v, int s )
{
struct Bits { signed int r: bits; } tmp;
tmp.r = v - s;
return tmp.r;
}
That's neat and all, but how will a compiler implement this? From this question I gather that accessing bit fields is more or less the same as doing it by hand, but what about when combined with arithmetic as in this example? Would it be as fast as a good manual bit-twiddling approach?
An answer for "gcc" in the role of "a compiler" would be great if anyone wants to get specific. I've tried reading the generated assembly, but it is currently beyond me.
As written in the other question, unsigned wrapping math can be done as:
int tmp = (a - b) & 0xFFF; /* 12 bit mask. */
Writing to a (12bit) bitfield will do exactly that, signed or unsigned. The only difference is that you might get a warning message from the compiler.
For reading though, you need to do something a bit different.
For unsigned maths, it's enough to do this:
int result = tmp; /* whatever bit count, we know tmp contains nothing else. */
or
int result = tmp & 0xFFF; /* 12bit, again, if we have other junk in tmp. */
For signed maths, the extra magic is the sign-extend:
int result = (tmp << (32-12)) >> (32-12); /* asssuming 32bit int, and 12bit value. */
All that does is replicate the top bit of the bitfield (bit 11) across the wider int.
This is exactly what the compiler does for bitfields. Whether you code them by hand or as bitfields is up to you, but just make sure you get the magic numbers right.
(I have not read the standard, but I suspect that relying on bitfields to do the right thing on overflow might not be safe?)
The compiler has knowledge about the size and exact position of r in your example. Suppose it is like
[xxxxrrrr]
Then
tmp.r = X;
could e.g. be expanded to (the b-suffix indicating binary literals, & is bitwise and, | is bitwise or)
tmp = (tmp & 11110000b) // <-- get the remainder which is not tmp.r
| (X & 00001111b); // <-- put X into tmp.r and filter away unwanted bits
Imagine your layout is
[xxrrrrxx] // 4 bits, 2 left-shifts
the expansion could be
tmp = (tmp & 11000011b) // <-- get the remainder which is not tmp.r
| ((X<<2) & 00111100b); // <-- filter 4 relevant bits, then shift left 2
How X actually looks like, whether a complex formulation or just a literal, is actually irrelevant.
If your architecture does not support such bitwise operations, there are still multiplications and divisions by power of two to simulate shifting, and probably these can also be used to filter out unwanted bits.

How to implement big int in C++

I'd like to implement a big int class in C++ as a programming exercise—a class that can handle numbers bigger than a long int. I know that there are several open source implementations out there already, but I'd like to write my own. I'm trying to get a feel for what the right approach is.
I understand that the general strategy is get the number as a string, and then break it up into smaller numbers (single digits for example), and place them in an array. At this point it should be relatively simple to implement the various comparison operators. My main concern is how I would implement things like addition and multiplication.
I'm looking for a general approach and advice as opposed to actual working code.
A fun challenge. :)
I assume that you want integers of arbitrary length. I suggest the following approach:
Consider the binary nature of the datatype "int". Think about using simple binary operations to emulate what the circuits in your CPU do when they add things. In case you are interested more in-depth, consider reading this wikipedia article on half-adders and full-adders. You'll be doing something similar to that, but you can go down as low level as that - but being lazy, I thought I'd just forego and find a even simpler solution.
But before going into any algorithmic details about adding, subtracting, multiplying, let's find some data structure. A simple way, is of course, to store things in a std::vector.
template< class BaseType >
class BigInt
{
typedef typename BaseType BT;
protected: std::vector< BaseType > value_;
};
You might want to consider if you want to make the vector of a fixed size and if to preallocate it. Reason being that for diverse operations, you will have to go through each element of the vector - O(n). You might want to know offhand how complex an operation is going to be and a fixed n does just that.
But now to some algorithms on operating on the numbers. You could do it on a logic-level, but we'll use that magic CPU power to calculate results. But what we'll take over from the logic-illustration of Half- and FullAdders is the way it deals with carries. As an example, consider how you'd implement the += operator. For each number in BigInt<>::value_, you'd add those and see if the result produces some form of carry. We won't be doing it bit-wise, but rely on the nature of our BaseType (be it long or int or short or whatever): it overflows.
Surely, if you add two numbers, the result must be greater than the greater one of those numbers, right? If it's not, then the result overflowed.
template< class BaseType >
BigInt< BaseType >& BigInt< BaseType >::operator += (BigInt< BaseType > const& operand)
{
BT count, carry = 0;
for (count = 0; count < std::max(value_.size(), operand.value_.size(); count++)
{
BT op0 = count < value_.size() ? value_.at(count) : 0,
op1 = count < operand.value_.size() ? operand.value_.at(count) : 0;
BT digits_result = op0 + op1 + carry;
if (digits_result-carry < std::max(op0, op1)
{
BT carry_old = carry;
carry = digits_result;
digits_result = (op0 + op1 + carry) >> sizeof(BT)*8; // NOTE [1]
}
else carry = 0;
}
return *this;
}
// NOTE 1: I did not test this code. And I am not sure if this will work; if it does
// not, then you must restrict BaseType to be the second biggest type
// available, i.e. a 32-bit int when you have a 64-bit long. Then use
// a temporary or a cast to the mightier type and retrieve the upper bits.
// Or you do it bitwise. ;-)
The other arithmetic operation go analogous. Heck, you could even use the stl-functors std::plus and std::minus, std::times and std::divides, ..., but mind the carry. :) You can also implement multiplication and division by using your plus and minus operators, but that's very slow, because that would recalculate results you already calculated in prior calls to plus and minus in each iteration. There are a lot of good algorithms out there for this simple task, use wikipedia or the web.
And of course, you should implement standard operators such as operator<< (just shift each value in value_ to the left for n bits, starting at the value_.size()-1... oh and remember the carry :), operator< - you can even optimize a little here, checking the rough number of digits with size() first. And so on. Then make your class useful, by befriendig std::ostream operator<<.
Hope this approach is helpful!
Things to consider for a big int class:
Mathematical operators: +, -, /,
*, % Don't forget that your class may be on either side of the
operator, that the operators can be
chained, that one of the operands
could be an int, float, double, etc.
I/O operators: >>, << This is
where you figure out how to properly
create your class from user input, and how to format it for output as well.
Conversions/Casts: Figure out
what types/classes your big int
class should be convertible to, and
how to properly handle the
conversion. A quick list would
include double and float, and may
include int (with proper bounds
checking) and complex (assuming it
can handle the range).
There's a complete section on this: [The Art of Computer Programming, vol.2: Seminumerical Algorithms, section 4.3 Multiple Precision Arithmetic, pp. 265-318 (ed.3)]. You may find other interesting material in Chapter 4, Arithmetic.
If you really don't want to look at another implementation, have you considered what it is you are out to learn? There are innumerable mistakes to be made and uncovering those is instructive and also dangerous. There are also challenges in identifying important computational economies and having appropriate storage structures for avoiding serious performance problems.
A Challenge Question for you: How do you intend to test your implementation and how do you propose to demonstrate that it's arithmetic is correct?
You might want another implementation to test against (without looking at how it does it), but it will take more than that to be able to generalize without expecting an excrutiating level of testing. Don't forget to consider failure modes (out of memory problems, out of stack, running too long, etc.).
Have fun!
addition would probably have to be done in the standard linear time algorithm
but for multiplication you could try http://en.wikipedia.org/wiki/Karatsuba_algorithm
Once you have the digits of the number in an array, you can do addition and multiplication exactly as you would do them longhand.
Don't forget that you don't need to restrict yourself to 0-9 as digits, i.e. use bytes as digits (0-255) and you can still do long hand arithmetic the same as you would for decimal digits. You could even use an array of long.
I'm not convinced using a string is the right way to go -- though I've never written code myself, I think that using an array of a base numeric type might be a better solution. The idea is that you'd simply extend what you've already got the same way the CPU extends a single bit into an integer.
For example, if you have a structure
typedef struct {
int high, low;
} BiggerInt;
You can then manually perform native operations on each of the "digits" (high and low, in this case), being mindful of overflow conditions:
BiggerInt add( const BiggerInt *lhs, const BiggerInt *rhs ) {
BiggerInt ret;
/* Ideally, you'd want a better way to check for overflow conditions */
if ( rhs->high < INT_MAX - lhs->high ) {
/* With a variable-length (a real) BigInt, you'd allocate some more room here */
}
ret.high = lhs->high + rhs->high;
if ( rhs->low < INT_MAX - lhs->low ) {
/* No overflow */
ret.low = lhs->low + rhs->low;
}
else {
/* Overflow */
ret.high += 1;
ret.low = lhs->low - ( INT_MAX - rhs->low ); /* Right? */
}
return ret;
}
It's a bit of a simplistic example, but it should be fairly obvious how to extend to a structure that had a variable number of whatever base numeric class you're using.
Use the algorithms you learned in 1st through 4th grade.
Start with the ones column, then the tens, and so forth.
Like others said, do it to old fashioned long-hand way, but stay away from doing this all in base 10. I'd suggest doing it all in base 65536, and storing things in an array of longs.
If your target architecture supports BCD (binary coded decimal) representation of numbers, you can get some hardware support for the longhand multiplication/addition that you need to do. Getting the compiler to emit BCD instruction is something you'll have to read up on...
The Motorola 68K series chips had this. Not that I'm bitter or anything.
My start would be to have an arbitrary sized array of integers, using 31 bits and the 32n'd as overflow.
The starter op would be ADD, and then, MAKE-NEGATIVE, using 2's complement. After that, subtraction flows trivially, and once you have add/sub, everything else is doable.
There are probably more sophisticated approaches. But this would be the naive approach from digital logic.
Could try implementing something like this:
http://www.docjar.org/html/api/java/math/BigInteger.java.html
You'd only need 4 bits for a single digit 0 - 9
So an Int Value would allow up to 8 digits each. I decided i'd stick with an array of chars so i use double the memory but for me it's only being used 1 time.
Also when storing all the digits in a single int it over-complicates it and if anything it may even slow it down.
I don't have any speed tests but looking at the java version of BigInteger it seems like it's doing an awful lot of work.
For me I do the below
//Number = 100,000.00, Number Digits = 32, Decimal Digits = 2.
BigDecimal *decimal = new BigDecimal("100000.00", 32, 2);
decimal += "1000.99";
cout << decimal->GetValue(0x1 | 0x2) << endl; //Format and show decimals.
//Prints: 101,000.99
The computer hardware provides facility of storing integers and doing basic arithmetic over them; generally this is limited to integers in a range (e.g. up to 2^{64}-1). But larger integers can be supported via programs; below is one such method.
Using Positional Numeral System (e.g. the popular base-10 numeral system), any arbitrarily large integer can be represented as a sequence of digits in base B. So, such integers can be stored as an array of 32-bit integers, where each array-element is a digit in base B=2^{32}.
We already know how to represent integers using numeral-system with base B=10, and also how to perform basic arithmetic (add, subtract, multiply, divide etc) within this system. The algorithms for doing these operations are sometimes known as Schoolbook algorithms. We can apply (with some adjustments) these Schoolbook algorithms to any base B, and so can implement the same operations for our large integers in base B.
To apply these algorithms for any base B, we will need to understand them further and handle concerns like:
what is the range of various intermediate values produced during these algorithms.
what is the maximum carry produced by the iterative addition and multiplication.
how to estimate the next quotient-digit in long-division.
(Of course, there can be alternate algorithms for doing these operations).
Some algorithm/implementation details can be found here (initial chapters), here (written by me) and here.
subtract 48 from your string of integer and print to get number of large digit.
then perform the basic mathematical operation .
otherwise i will provide complete solution.

Converting floating point to fixed point

In C++, what's the generic way to convert any floating point value (float) to fixed point (int, 16:16 or 24:8)?
EDIT: For clarification, fixed-point values have two parts to them: an integer part and a fractional part. The integer part can be represented by a signed or unsigned integer data type. The fractional part is represented by an unsigned data integer data type.
Let's make an analogy with money for the sake of clarity. The fractional part may represent cents -- a fractional part of a dollar. The range of the 'cents' data type would be 0 to 99. If a 8-bit unsigned integer were to be used for fixed-point math, then the fractional part would be split into 256 evenly divisible parts.
I hope that clears things up.
Here you go:
// A signed fixed-point 16:16 class
class FixedPoint_16_16
{
short intPart;
unsigned short fracPart;
public:
FixedPoint_16_16(double d)
{
*this = d; // calls operator=
}
FixedPoint_16_16& operator=(double d)
{
intPart = static_cast<short>(d);
fracPart = static_cast<unsigned short>
(numeric_limits<unsigned short> + 1.0)*d);
return *this;
}
// Other operators can be defined here
};
EDIT: Here's a more general class based on anothercommon way to deal with fixed-point numbers (and which KPexEA pointed out):
template <class BaseType, size_t FracDigits>
class fixed_point
{
const static BaseType factor = 1 << FracDigits;
BaseType data;
public:
fixed_point(double d)
{
*this = d; // calls operator=
}
fixed_point& operator=(double d)
{
data = static_cast<BaseType>(d*factor);
return *this;
}
BaseType raw_data() const
{
return data;
}
// Other operators can be defined here
};
fixed_point<int, 8> fp1; // Will be signed 24:8 (if int is 32-bits)
fixed_point<unsigned int, 16> fp1; // Will be unsigned 16:16 (if int is 32-bits)
A cast from float to integer will throw away the fractional portion so if you want to keep that fraction around as fixed point then you just multiply the float before casting it. The below code will not check for overflow mind you.
If you want 16:16
double f = 1.2345;
int n;
n=(int)(f*65536);
if you want 24:8
double f = 1.2345;
int n;
n=(int)(f*256);
**** Edit** : My first comment applies to before Kevin's edit,but I'll leave it here for posterity. Answers change so quickly here sometimes!
The problem with Kevin's approach is that with Fixed Point you are normally packing into a guaranteed word size (typically 32bits). Declaring the two parts separately leaves you to the whim of your compiler's structure packing. Yes you could force it, but it does not work for anything other than 16:16 representation.
KPexEA is closer to the mark by packing everything into int - although I would use "signed long" to try and be explicit on 32bits. Then you can use his approach for generating the fixed point value, and bit slicing do extract the component parts again. His suggestion also covers the 24:8 case.
( And everyone else who suggested just static_cast.....what were you thinking? ;) )
I gave the answer to the guy that wrote the best answer, but I really used a related questions code that points here.
It used templates and was easy to ditch dependencies on the boost lib.
This is fine for converting from floating point to integer, but the O.P. also wanted fixed point.
Now how you'd do that in C++, I don't know (C++ not being something I can think in readily). Perhaps try a scaled-integer approach, i.e. use a 32 or 64 bit integer and programmatically allocate the last, say, 6 digits to what's on the right hand side of the decimal point.
There isn't any built in support in C++ for fixed point numbers. Your best bet would be to write a wrapper 'FixedInt' class that takes doubles and converts them.
As for a generic method to convert... the int part is easy enough, just grab the integer part of the value and store it in the upper bits... decimal part would be something along the lines of:
for (int i = 1; i <= precision; i++)
{
if (decimal_part > 1.f/(float)(i + 1)
{
decimal_part -= 1.f/(float)(i + 1);
fixint_value |= (1 << precision - i);
}
}
although this is likely to contain bugs still