Integer division in Mixed Integer Linear Programming - linear-programming

What is a solver friendly way of encoding integer division in MILP programs? Currently I'm using the following encoding (in Gurobi python), which may not be totally correct and I hope not that optimal.
# res is an integer variable in the solver
# val is an integer variable in the solver
# divVal is just a python variable, a constant for solver
offset = 0.999
divRes = val / divVal
model.addConstr(divRes - offset <= res)
model.addConstr(res <= divRes)
Above encoding essentially says that res should be assigned a value between divRes - offset and divRes, since offset is 0.999, there should be only 1 integer in the range and solver is forced to assign that to res. Is there a better (faster) way of encoding this?
EDIT: By integer division I mean that the result of division is an integer. If there is any fractional part after the division, I want to discard that and round down the result which will be stored in res. What I essentially want to do is shift a number by some x bits. In MILP solver, that boils down to dividing a number by (1 << x), but there is some fractional part after the division which I want to get rid of.

model.addRange(val - divVal*res, 0, 0.99999, name="Range")
I would prefer to use this above mentioned Range Constraint only. Incorporating tighter bounds (there is only integer in the given range, which we require) directly into the model can not only improve the numerical behavior, but it also speed up the optimization process (because gurobi use branch and bound algorithm to get solutions)
https://www.gurobi.com/documentation/9.1/refman/improving_ranges_for_varia.html
Optimality - A small change in model can easily calculate an optimal result, adding res in a Minimisation type objective function or negative of it in the maximisation function would shrink its value at lower side if divVal*res will become integer. Gurobi does not provide Less than constraint. Moreover, An integrality restriction on a variable is considered satisfied in Gurobi when the variable's value is less than IntFeasTol from the nearest integer value. Default value of IntFeasTol tolerance is 1e-5, and it could be further reduced up to 1e-9 for a better result. However, making multi-objective model, add extra complexity to the model. I would not like to recommend it.
model.addRange(val - divVal*res, 0, 1, name="Range")
model.setObjective(res, GRB.MINIMIZE)

Related

Optimal way to do unsigned integer multiplication (with overflow) followed by division by the maximum value

I have the following problem - I have 3 unsigned integers: a,b,c. c is the maximum value that the unsigned integer type I am using can take(so a<=c && b<=c). I know that a*b may overflow, and I know that the number (a * b / c) does not overflow (basically I need the number that one would get if he casts a,b,c to an unsigned integer type with enough bits and performs the multiplication and division). What is the fastest way to find the number (a * b / c) without having to cast to an unsigned integer type with more bits (preferably with the lowest error possible)?
I am currently using float casting, and I was wondering whether there was a method that produces results that are better and faster. I know that a*b/c can be expressed as either a/(c/b) or b/(c/a), with the option of expanding the remainder too, but depending on how many times I expand it, the error can be big, and I am not exactly sure how that compares in terms of speed to float casting and performing the division in float. I have looked at: Avoiding overflow in integer multiplication followed by division though I was hoping that with the given additional information about a,b,c a better/faster method may be used. The post I linked also doesn't mention anything about speed.
Edit: It's also preferable for the method to work fast on the GPU too, but not necessary.
Edit2: If anybody's wondering why I would need this - I am using uint to represent numbers in [0,1], since I need arithmetic operations to not accumulate error.

Stop double division before decimals (low precision, fast division; getting only the 'quotient')

Basically a performance related question:
I want to get only the integer quotient from a double division, i.e. for example, for a division 88.3/12.7 = 6.9527559055118110236220472440945, I only want to get '6' as a result.
A possible implementation would be of course: floor(x/y), but here, first the performance-intensive double division is done and afterwards floor throws away most of the 'work' the double division did.
So basically I want a division with doubles which 'stops' before calculating all these decimal points and just gives me the correct integer result of the division, without rounding or truncating the initial double arguments. Does anyone know an elegant implementation for this (I searched for this topic but didn't find much)?
Another implementation I can imagine is:
int(x*1000)/int(y*1000)
Where instead of 1000, the needed 'precision' can be used. A very simple implementation would be also simply subtracting y from x until the result is smaller than zero. But yeah, I was wondering what would be the best way to do it.
Also, doing simply int(x)/int(y) is no option since it could easily result in wrong results.
By the way, I know this is probably again one of these 'micro-optimization' questions which deal with a matter that does not really matter on new machines, but well, I still am kinda curious on the subject! :-)
There is no way to stop earlier, and using integer division is potentially slower.
For example, on Skylake:
idiv r/m32 L: 26-27 T: 6
divsd xmm, xmm L: 13-14 T: 4
(source)
So the double division is twice as fast and has a significantly better throughput. That is before you factor in the extra multiplications and extra cast.
On older µarchs, 32 bit integer division often has lower latency numbers listed than double division, but they varied more (division used to be more serial), with (for floats) round divisors being faster yet for integer division it's small results that are faster. This difference in characteristics can make it swing either way, depending on what you're dividing by what.
As you can see it's dangerous in this case to optimize without a specific target in mind, but I imagine newer machines are a more likely target than older machines, which means the double division is more or less the best you can do anyway (unless other optimizations apply). Dividing single precision floats is faster by itself but incurs a conversion cost, it actually ends up losing (5+10) if you add them up.

Arbitrary precision gamma function

I'm implementing an arbitrary precision arithmetic library in C++ and I'm pretty much stuck when implementing the gamma function.
By using the equivalences gamma(n) = gamma(n - 1) * n and gamma(n) = gamma(n + 1) / n, respectively, I can obtain a rational number r in the range (1; 2] for all real values x.
However, I don't know how to evaluate gamma(r). For the Lanczos approximation (https://en.wikipedia.org/wiki/Lanczos_approximation), I need precomputed values p which happen to calculate a factorial of a non-integer value (?!) and can't be calculated dynamically with my current knowledge... Precomputing values for p wouldn't make much sense when implementing an arbitrary precision library.
Are there any algorithms that compute gamma(r) in a reasonable amount of time with arbitrary precision? Thanks for your help.
Spouge's approximation is similar to Lanczos's approximation, but probably easier to use for arbitrary precision, as you can set the desired error.
Lanczos approximation doesn't seem too bad. What exactly do you suspect?
Parts of code which calculate p, C (Chebyshev polynomials) and (a + 1/2)! can be implemented as stateful objects so that, for example, you can calculate p(i) from p(i-1) and Chebyshev coefficients and be computed once, maintaining their matrix.

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

Proper way to generate a random float given a binary random number generator?

Let's say we have a binary random number generator, int r(); that will return a zero or a one both with propability 0.5.
I looked at Boost.Random, and they generate, say, 32 bits and do something like this (pseudocode):
x = double(rand_int32());
return min + x / (2^32) * (max - min);
I have some serious doubts about this. A double has 53 bits of mantissa, and 32 bits can never properly generate a fully random mantissa, among other things such as rounding errors, etc.
What would be a fast way to create a uniformly distributed float or double in the half-open range [min, max), assuming IEEE754? The emphasis here lies on correctness of distribution, not speed.
To properly define correct, the correct distribution would be equal to the one that we would get if we would take an infinitely precise uniformly distributed random number generator and for each number we would round to the nearest IEEE754 representation, if that representation would still be within [min, max), otherwise the number would not count for the distribution.
P.S.: I would be interested in correct solutions for open ranges as well.
AFAIK, the correct (and probably also fastest) way is to first create a 64 bit unsigned integer where the 52 fraction bits are random bits, and the exponent is 1023, which if type punned into a (IEEE 754) double will be a uniformly distributed random value in the range [1.0, 2.0). So the last step is to subtract 1.0 from that, resulting in a uniformly distributed random double value in the range [0.0, 1.0).
In pseudo code:
rndDouble = bitCastUInt64ToDouble(1023 << 52 | rndUInt64 & 0xfffffffffffff) - 1.0
This method is mentioned here:
http://xoroshiro.di.unimi.it
(See "Generating uniform doubles in the unit interval")
EDIT: The recommended method has since changed to:
(x >> 11) * (1. / (UINT64_C(1) << 53))
See above link for details.
Here is a correct approach with no attempt at efficiency.
We start with a bignum class, and then a rational wrapper of said bignums.
We produce a range "sufficiently bigger than" our [min, max) range, so that rounding of our smaller_min and bigger_max produces floating point values outside that range, in our rational built on the bignum.
Now we subdivide the range into two parts perfectly down the middle (which we can do, as we have a rational bignum system). We pick one of the two parts at random.
If, after rounding, the top and bottom of the picked range would be (A) outside of [min, max) (on the same side, mind you!) you reject and restart from the beginning.
If (B) the top and bottom of your range rounds to the same double (or float if you are returning a float), you are done, and you return this value.
Otherwise (C) you recurse on this new, smaller range (subdivide, pick randomly, test).
There are no guarantees that this procedure halts, because you can either constantly drill down to the "edge" between two rounding doubles, or you could constantly pick values outside of the [min, max) range. The probability of this happening is (never halting), however, zero (assuming a good random number generator, and a [min, max) of non-zero size).
This also works for (min, max), or even picking a number in the rounded sufficiently fat Cantor set. So long as the measure of the valid range of reals that round to the correct floating point values is non zero, and the range has a compact support, this procedure can be run and has a probability of 100% of terminating, but no hard upper bound on the time it takes can be made.
The problem here is that in IEEE754 the doubles which may be represented are not equi-distributed. That is, if we have a generator generating real numbers, say in (0,1) and then map to IEEE754 representable numbers, the result will not be equi-distributed.
Thus, we have to define "equi-distribution". That said, assuming that each IEEE754 number is just a representative for the probability of lying in the interval defined by the IEEE754 rounding, the procedure of first generating equi-distributed "numbers" and the round to IEEE754 will generate (by definition) an "equi-distribution" of IEEE754 numbers.
Hence, I believe that the above formula will become arbitrary close to such a distribution if we just choose the accuracy high enough. If we restrict the problem to finding a number in [0,1) this means to restricting to the set of denomalized IEEE 754 numbers, which are one-to-one to a 53 bit integer. Thus it should be fast and correct to generate just the mantissa by a 53 bit binary random number generator.
IEEE 754 arithmetic is always "arithmetic at infinite precision followed by rounding", i.e. the IEEE754 number representing ab is the one being closest to ab (put differently, you can think of a*b calculated at infinite precision, then rounded to the closes IEEE754 number). Hence I believe that min + (max-min) * x, where x is a denomalized number, is a feasible approach.
(Note: As clear from my comment, I was first not aware that you where pointing to the case with min and max different from 0,1. The denormalized numbers have the property that they are evenly spaced. Hence you get the equi distribution by mapping the 53 bits to the mantissa. Next you can use the floating point arithmetic, due fact that it is correct up to machine precistion. If you use the reverse mapping you will recover the equi-distribution.
See this question for another aspect of this problem: Scaling Int uniform random range into Double one
std::uniform_real_distribution.
There's a really good talk by S.T.L. from this year’s Going Native conference that explains why you should use the standard distributions whenever possible. In short, hand-rolled code tends to be of laughably poor quality (think std::rand() % 100), or have more subtle uniformity flaws, such as in (std::rand() * 1.0 / RAND_MAX) * 99, which is the example given in the talk and is a special case of the code posted in the question.
EDIT: I took a look at libstdc++’s implementation of std::uniform_real_distribution, and this is what I found:
The implementation produces a number in the range [dist_min, dist_max) by using a simple linear transformation from some number produced in the range [0, 1). It generates this source number using std::generate_canonical, the implementation of which my be found here (at the end of the file). std::generate_canonical determines the number of times (denoted as k) the range of the distribution, expressed as an integer and denoted here as r*, will fit in the mantissa of the target type. What it then does is essentially to generate one number in [0, r) for each r-sized segment of the mantissa and, using arithmetic, populate each segment accordingly. The formula for the resulting value may be expressed as
Σ(i=0, k-1, X/(r^i))
where X is a stochastic variable in [0, r). Each division by the range is equivalent to a shift by the number of bits used to represent it (i.e., log2(r)), and so fills the corresponding mantissa segment. This way, the whole of the precision of the target type is used, and since the range of the result is [0, 1), the exponent remains 0** (modulo bias) and you don’t get the uniformity issues you have when you start messing with the exponent.
I would not trust implicity that this method is cryptographically secure (and I have suspicions about possible off-by-one errors in the calculation of the size of r), but I imagine it is significantly more reliable in terms of uniformity than the Boost implementation you posted, and definitely better than fiddling about with std::rand.
It may be worth noting that the Boost code is in fact a degenerate case of this algorithm where k = 1, meaning that it is equivalent if the input range requires at least 23 bits to represent its size (IEE 754 single-precision) or at least 52 bits (double-precision). This means a minimum range of ~8.4 million or ~4.5e15, respectively. In light of this information, I don’t think that if you’re using a binary generator, the Boost implementation is quite going to cut it.
After a brief look at libc++’s implementation, it looks like they are using what is the same algorithm, implemented slightly differently.
(*) r is actually the range of the input plus one. This allows using the max value of the urng as valid input.
(**) Strictly speaking, the encoded exponent is not 0, as IEEE 754 encodes an implicit leading 1 before the radix of the significand. Conceptually, however, this is irrelevant to this algorithm.