Optimal way to do unsigned integer multiplication (with overflow) followed by division by the maximum value - c++

I have the following problem - I have 3 unsigned integers: a,b,c. c is the maximum value that the unsigned integer type I am using can take(so a<=c && b<=c). I know that a*b may overflow, and I know that the number (a * b / c) does not overflow (basically I need the number that one would get if he casts a,b,c to an unsigned integer type with enough bits and performs the multiplication and division). What is the fastest way to find the number (a * b / c) without having to cast to an unsigned integer type with more bits (preferably with the lowest error possible)?
I am currently using float casting, and I was wondering whether there was a method that produces results that are better and faster. I know that a*b/c can be expressed as either a/(c/b) or b/(c/a), with the option of expanding the remainder too, but depending on how many times I expand it, the error can be big, and I am not exactly sure how that compares in terms of speed to float casting and performing the division in float. I have looked at: Avoiding overflow in integer multiplication followed by division though I was hoping that with the given additional information about a,b,c a better/faster method may be used. The post I linked also doesn't mention anything about speed.
Edit: It's also preferable for the method to work fast on the GPU too, but not necessary.
Edit2: If anybody's wondering why I would need this - I am using uint to represent numbers in [0,1], since I need arithmetic operations to not accumulate error.

Related

How to fix the position of binary point in an unsigned N-bit interger?

I am working on developing a fixed point algorithm in C++. I know that, for a N-bit integer, the fixed point binary integer is represented as U(a,b). For example, for an 8 bit Integer (i.e 256 samples), If we represent it in the form U(6,2), it means that the binary point is to the left of the 2nd bit starting from the right of the form:
b5 b4 b3 b2 b1 b0 . b(-1) b(-2)
Thus , it has 6 integer bits and 2 fractional bits. In C++, I know there are some bit shift operators I can use, but they are basically used for shifting the bits of the input stream, my question is, how to define a binary fixed point integer of the form, fix<6,2> or U(6,2). All the major processing operation will be carried out on the fractional part and I am just finding a way to do this fix in C++. Any help regarding this would be appreciated.Thanks!
Example : Suppose I have an input discrete signal with 1024 sample points on x-axis (For now just think this input signal is coming from some sensor). Each of this sample point has a particular amplitude. Say the sample at time 2(x-axis) has an amplitude of 3.67(y-axis). Now I have a variable "int *input;" that takes the sample 2, which in binary is 0000 0100. So basically I want to make this as 00000.100 by performing the U(5,3) on the sample 2 in C++. So that I can perform the interpolation operations on fractions of the input sampling period or time.
PS - I don't want to create a separate class or use external libraries for this. I just want to take each 8 bits from my input signal, perform the U(a,b) fix on it followed by rest of the operations are done on the fractional part.
Short answer: left shift.
Long answer:
Fixed point numbers are stored as integers, usually int, which is the fastest integer type for a particular platform.
Normal integer without fractional bits are usually called Q0, Q.0 or QX.0 where X is the total number of bits of underlying storage type(usually int).
To convert between different Q.X formats, left or right shift. For example, to convert 5 in Q0 to 5 in Q4, left shift it 4 bits, or multiply it by 16.
Usually it's useful to find or write a small fixed point library that does basic calculations, like a*b>>q and (a<<q)/b. Because you will do Q.X=Q.Y*Q.Z and Q.X=Q.Y/Q.Z a lot and you need to convert formats when doing calculations. As you may have observed, using normal * operator will give you Q.(X+Y)=Q.X*Q.Y, so in order to fit the result into Q.Z format, you need to right shift the result by (X+Y-Z) bits.
Division is similar, you get Q.(X-Y)=Q.X*Q.Y form the standard / operator, and to get the result in Q.Z format you shift the dividend before the division. What's different is that division is an expensive operation, and it's not trivial to write a fast one from scratch.
Be aware of double-word support of your platform, it will make your life a lot easier. With double word arithmetic, result of a*b can be twice the size of a or b, so that you don't lose range by doing a*b>>c. Without double word, you have to limit the input range of a and b so that a*b doesn't overflow. This is not obvious when you first start, but soon you will find you need more fractional bits or rage to get the job done, and you will finally need to dig into the reference manual of your processor's ISA.
example:
float a = 0.1;// 0.1
int aQ16 = a*65536;// 0.1 in Q16 format
int bQ16 = 4<<16// 4Q16
int cQ16 = a*b>>16 // result = 0.399963378906250Q16 = 26212,
// not 0.4Q16 = 26214 because of truncating error
If this is your question:
Q. Should I define my fixed-binary-point integer as a template, U<int a, int b>(int number), or not, U(int a, int b)
I think your answer to that is: "Do you want to define operators that take two fixed-binary-point integers? If so make them a template."
The template is just a little extra complexity if you're not defining operators. So I'd leave it out.
But if you are defining operators, you don't want to be able to add U<4, 4> and U<6, 2>. What would you define your result as? The templates will give you a compile time error should you try to do that.

arithmetics between large primitive numeric types : can I save the overflow?

disclaimer: I know, unsigned integers are primitive numeric types but they don't technically overflow, I'm using the term "overflow" for all the primitive numeric types in general here.
in C or C++, according to the standard or to a particular implementation, there are primitive numeric types where given an arithmetic operation, even if this operation could possibly overflow, I can save the result + plus the part that overflows ?
Even if this sounds strange, my idea is that the registers on modern CPUs are usually much larger than a 32 bit float or a 64 bit uint64_t, so there is the potential to actually control the overflow and store it somewhere.
No, the registers are not "usually much larger than a 64 bit uint64_t".
There's an overflow flag, and for a limited number of operations (addition and subtraction), pairing this single additional bit with the result is enough to capture the entire range of outcomes.
But in general, you'd need to cast to a larger type (potentially implemented in software) to handle results that overflow the type of your input.
Any operations that do this sort of thing (for example some 32-bit processors had a 32x32 => 32 high, 32 low wide multiply instruction) will be provided by your compiler as intrinsic functions, or via inline assembly.
look, I found a 64-bit version of that, named Multiply128 and the matching __mul128 instrinsic available in Visual C++
See #Ben Voigt about larger registers.
There really may only be an overflow bit that could help you.
Another approach, without resorting to wider integers, is to test overflow yourself:
unsigned a,b,sum;
sum = a + b;
if (sum < a) {
OverflowDetected(); // mathematical result is `sum` + UINT_MAX + 1
}
Similar approach for int.
The following likely may be simplified - just don't have it at hand.
[Edit]
My below apporach has potentila UB. For a better way to detect int overflow, see Simpler method to detect int overflow
int a,b,sum;
sum = a + b;
// out-of-range only possible when the signs are the same.
if ((a < 0) == (b < 0)) {
if (a < 0) {
if (sum > b) UnderflowDetected();
}
else {
if (sum < b) OverflowDetected();
}
For floating point type, you can actually 'control' the overflow on x86 platform. With these functions "_control87, _controlfp, __control87_2", you can Gets and sets the floating-point control word. By default, the run-time libraries mask all floating-point exceptions; you can unmask them in your code, so when a overflow occurs, you'll get an exception. However, the code we write today, all assume that the floating point exceptions are masked, so if you unmask them, you'll encounter some problem.
You can use these functions to get the status word.
For floating point types the results are well defined by the hardware, and you're not going to be able to get much control without working around C/C++.
For integer types smaller than int, they will be upsized to an int by any arithmetic operation. You will probably be able to detect overflow before you coerce the result back into the smaller type.
For addition and subtraction you can detect overflow by comparing the result to the inputs. Adding two positive integers will always yield a result larger than either of the inputs, unless there was overflow.

Conversion Big Integer <-> double in C++

I am writing my own long arithmetic library in C++ for fun and it is already pretty finished, I even implemented several Cryptogrphic algorithms with that library, but one important thing is still missing: I want to convert doubles (and floats/long doubles) into my number and vice versa. My numbers are represented as a variable sized array of unsigned long ints plus a sign bit.
I tried to find the answer with google, but the problem is that people rarely ever implement such things themselves, so I only find things about how to use Java BigInteger etc.
Conceptually, it is rather easy: I take the mantissa, shift it by the number of bits dictated by the exponent and set the sign. In the other direction I truncate it so that it fits into the mantissa and set the exponent depending on my log2 function.
But I am having a hard time to figure out the details, I could either play around with some bit patterns and cast it to a double, but I didn't find an elegant way to achieve that or I could "calculate" it by starting with 2, exponentiate, multiply etc, but that doesn't seem very efficient.
I would appreciate a solution that doesn't use any library calls because I am trying to avoid libraries for my project, otherwise I could just have used gmp, furthermore, I often have two solutions on several other occasions, one using inline assembler which is efficient and one that is more platform independent, so either answer is useful for me.
edit: I use uint64_t for my parts, but I would like to be able to change it depending on the machine, but I am willing to do some different implementations with some #ifdefs to achieve that.
I'm going to make non-portable assumption here: namely, that unsigned long long has more accurate digits than double. (This is true on all modern desktop systems that I know of.)
First, convert the most significant integer(s) into an unsigned long long. Then convert that to a double S. Let M be the number of integers less than those used in that first step. multiply S by(1ull << (sizeof(unsigned)*CHAR_BIT*M). (If shifting more than 63 bits, you will have to split those into seperate shifts and do some alrithmetic) Finally, if the original number was negative you multiply this result by -1.
This rounds a lot, but even with this rounding, due to the above assumption, no digits are lost that wouldn't be lost anyway with the conversion to a double. I think this is a similar process to what Mark Ransom said, but I'm not certain.
For converting from a double to a biginteger, first seperate the mantissa into a double M and the exponent into an int E, using frexp. Multiply M by UNSIGNED_MAX, and store that result in an unsigned R. If std::numeric_limits<double>::radix() is 2 (I don't know if it is or not for x86/x64), you can easily shift R left by E-(sizeof(unsigned)*CHAR_BIT) bits and you're done. Otherwise the result will instead beR*(E**(sizeof(unsigned)*CHAR_BIT)) (where ** means to the power of)
If performance is a concern, you can add an overload to your bignum class for multiplying by std::constant_integer<unsigned, 10>, which simply returns (LHS<<4)+(LHS<<2). You can similarly optimize other constants if you wish.
This blog post might help you Clarifying and optimizing Integer>>asFloat
Otherwise, you can yet have an idea of algorithm with this SO question Converting from unsigned long long to float with round to nearest even
You don't say explicitly, but I assume your library is integer only and the unsigned longs are 32 bit and binary (not decimal). The conversion to double is simple, so I'll tackle that first.
Start with a multiplier for the current piece; if the number is positive it will be 1.0, if negative it will be -1.0. For each of the unsigned long ints in your bignum, multiply by the current multiplier and add it to the result, then multiply your multiplier by pow(2.0, 32) (4294967296.0) for 32 bits or pow(2.0, 64) (18446744073709551616.0) for 64 bits.
You can optimize this process by working with only the 2 most significant values. You need to use 2 even if the number of bits in your integer type is larger than the precision of a double, since the number of used bits in the most significant value might only be 1. You can generate the multiplier by taking a power of 2 to the number of skipped bits, e.g. pow(2.0, most_significant_count*sizeof(bit_array[0])*8). You can't use a bit shift as given in another answer because it will overflow after the first value.
To convert from double, you can get the exponent and mantissa separated from each other with the frexp function. The mantissa will come as a floating point value between 0.5 and 1.0 so you'll want to multiply it by pow(2.0, 32) or pow(2.0, 64) to convert it to an integer, then adjust the exponent by -32 or -64 to compensate.
To go from a big integer to a double, just do it the same way you parse numbers. For example, you parse the number "531" as "1 + (3 * 10) + (5 * 100)". Compute each portion using doubles, starting with the least significant portion.
To go from a double to a big integer, do it the same way but in reverse starting with the most significant portion. So, to convert 531, you first see that it's more than 100 but less than 1000. You find the first digit by dividing by 100. Then you subtract to get the remainder of 31. Then find the next digit by dividing by 10. And so on.
Of course, you won't be using tens (unless you store your big integers as digits). Exactly how you break it apart depends on how your big integer class is constructed. For example, if it's uses 64-bit units, then you'll use powers of 2^64 instead of powers of 10.

How do I find the largest integer fully supported by hardware arithmetics?

I am implementing a BigInt class that must support arbitrary-precision operations on integers.
Quote from "The Algorithm Design Manual" by S.Skiena:
What base should I do [editor's note: arbitrary-precision] arithmetic in? - It is perhaps simplest to implement your own high-precision arithmetic package in decimal, and thus represent each integer as a string of base-10 digits. However, it is far more efficient to use a higher base, ideally equal to the square root of the largest integer supported fully by hardware arithmetic.
How do I find the largest integer supported fully by hardware arithmetic? If I understand correctly, being my machine an x64 based PC, the largest integer supported should be 2^64 (http://en.wikipedia.org/wiki/X86-64 - Architectural features: 64-bit integer capability), so I should use base 2^32, but is there a way in c++ to get this size programmatically so I can typedef my base_type to it?
You might be searching for std::uintmax_t and std::intmax_t.
static_cast<unsigned>(-1) is the max int. e.g. all bits set to 1 Is that what you are looking for ?
You can also use std::numeric_limits<unsigned>::max() or UINT_MAX, and all of these will yield the same result. and what these values tell is the maximum capacity of unsigned type. e.g. the maximum value that can be stored into unsigned type.
int (and, by extension, unsigned int) is the "natural" size for the architecture. So a type that has half the bits of an int should work reasonably well. Beyond that, you really need to configure for the particular hardware; the type of the storage unit and the type of the calculation unit should be typedefs in a header and their type selected to match the particular processor. Typically you'd make this selection after running some speed tests.
INT_MAX doesn't help here; it tells you the largest value that can be stored in an int, which may or may not be the largest value that the hardware can support directly. Similarly, INTMAX_MAX is no help, either; it tells you the largest value that can be stored as an integral type, but doesn't tell you whether operations on such a value can be done in hardware or require software emulation.
Back in the olden days, the rule of thumb was that operations on ints were done directly in hardware, and operations on longs were done as multiple integer operations, so operations on longs were much slower than operations on ints. That's no longer a good rule of thumb.
Things are not so black and white. There are MAY issues here, and you may have other things worth considering. I've now written two variable precision tools (in MATLAB, VPI and HPF) and I've chosen different approaches in each. It also matters whether you are writing an integer form or a high precision floating point form.
The difference is, integers can grow without bound in the number of digits. But if you are doing a floating point implementation with a user specified number of digits, you always know the number of digits in the mantissa. This is fixed.
First of all, it is simplest to use a single integer for each decimal digit. This makes many things work nicely, so I/O is easy. It is a bit inefficient in terms of storage though. Adds and subtracts are easy though. And if you use integers for each digit, then multiplies are even easy. In MATLAB for example, conv is pretty fast, though it is still O(n^2). I think gmp uses an fft multiply, so faster yet.
But assuming you use a basic conv multiply, then you need to worry about overflows for numbers with a huge number of digits. For example, suppose I store decimal digits as 8 bit signed integers. Using conv, followed by carries, I can do a multiply. For example, suppose I have the number 9999.
N = repmat(9,1,4)
N =
9 9 9 9
conv(N,N)
ans =
81 162 243 324 243 162 81
Thus even to form the product 9999*9999, I'd need to be careful as the digits will overflow an 8 bit signed integer. If I'm using 16 bit integers to accumulate the convolution products, then a multiply between a pair of 1000 digits integers can cause an overflow.
N = repmat(9,1,1000);
max(conv(N,N))
ans =
81000
So if you are worried about the possibility of millions of digits, you need to watch out.
One alternative is to use what I call migits, essentially working in a higher base than 10. Thus by using base 1000000 and doubles to store the elements, I can store 6 decimal digits per element. A convolution will still cause overflows for larger numbers though.
N = repmat(999999,1,10000);
log2(max(conv(N,N)))
ans =
53.151
Thus a convolution between two sets of base 1000000 migits that are 10000 migits in length (60000 decimal digits) will overflow the point where a double cannot represent an integer exactly.
So again, if you will use numbers with millions of digits, beware. A nice thing about the use of a higher base of migits with a convolution based multiply is since the conv operation is O(n^2), then going from base 10 to base 100 gives you a 4-1 speedup. Going to base 1000 yields a 9-1 speedup in the convolutions.
Finally, the use of a base other than 10 as migits makes it logical to implement guard digits (for floats.) In floating point arithmetic, you should never trust the least significant bits of a computation, so it makes sense to keep a few digits hidden in the shadows. So when I wrote my HPF tool, I gave the user control of how many digits would be carried along. This is not an issue for integers of course.
There are many other issues. I discuss them in the docs carried with those tools.

Avoiding overflow in integer multiplication followed by division

I have two integral variables a and b and a constant s resp. d. I need to calculate the value of (a*b)>>s resp. a*b/d. The problem is that the multiplication may overflow and the final result will not be correct even though a*b/d could fit in the given integral type.
How could that be solved efficiently? The straightforward solution is to expand the variable a or b to a larger integral type, but there may not be a larger integral type. Is there any better way to solve the problem?
If there isn't a larger type, you will either need to find a big-int style library, or deal with it manually, using long multiplication.
For instance, assume a and b are 16-bit. Then you can rewrite them as a = (1<<8)*aH + aL, and b = (1<<8)*bH + bL (where all the individual components are 8-bit numbers). Then you know that the overall result will be:
(a*b) = (1<<16)*aH*bH
+ (1<<8)*aH*bL
+ (1<<8)*aL*bH
+ aL*bL
Each of these 4 components will fit a 16-bit register. You can now perform e.g. right-shifts on each of the individual components, being careful to deal with carries appropriately.
I haven't exhaustively tested this, but could you do the division first, then account for the remainder, at the expense of extra operations? Since d is a power of two, all the divisions can be reduced to bitwise operations.
For example, always assume a > b (you want to divide the larger number first). Then a * b / d = ((a / d) * b) + (((a % d) * b) / d)
If the larger type is just 64 bits then the straight forward solution will most likely result in efficient code. On x86 CPUs any multiplication of two 32 bit numbers will give the overflow in another register. So if your compiler understands that, it can generate efficient code for Int64 result=(Int64)a*(Int64)b.
I had the same problem in C#, and the compiler generated pretty good code. And C++ compilers typically create better code than the .net JIT.
I recommend writing the code with the casts to the larger types and then inspect the generated assembly code to check if it's good.
In certain cases (historically LCG random number generators with selected constants), it is possible to do what you want, for some values of a and d.
This is called Schrage's method, see eg. there.