Are C/C++ library functions and operators the most optimal ones? - c++

So, at the divide & conquer course we were taught:
Karatsuba multiplication
Fast exponentiation
Now, given 2 positive integers a and b is operator::* faster than a karatsuba(a,b) or is pow(a,b) faster than
int fast_expo(int Base, int exp)
{
if (exp == 0) {
return 1;
}
if (exp == 1) {
return Base
}
if (exp % 2 == 0) {
return fast_expo(Base, exp / 2) * fast_expo(Base, exp / 2);
}
else {
return base * fast_expo(Base, exp / 2) * fast_expo(Base, exp / 2);
}
}
I ask this because I wonder if they have just a teaching purpose or they are already base implemented in the C/C++ language

Karatsuba multiplication is a special technique for large integers. It is not comparable to the built in C++ * operator which multiplies together operands of basic type like int and double.
To take advantage of Karatsuba, you have to be using multi-precision integers made up of at least around 8 words. (512 bits, if these are 64 bit words). The break-even point at which Karatsuba becomes advantageous is at somewhere between 8 and 24 machine words, according to the accepted answer to this question.
The pow function which works with a pair of floating-point operands of type double, is not comparable to your fast_expo, which works with operands of type int. They are different functions with different requirements. With pow, you can calculate the cube root of 5: pow(5, 1/3.0). If that's what you would like to calculate, then fast_expo is of no use, no matter how fast.
There is no guarantee that your compiler or C library's pow is absolutely the fastest way for your machine to exponentiate two double-precision floating-point numbers.
Optimization claims in floating-point can be tricky, because it often happens that multiple implementations of the "same" function do not give exactly the same results down to the last bit. You can probably write a fast my_pow that is only good to five decimal digits of precision, and in your application, that approximation might be more than adequate. Have you beat the library? Hardly; your fast function doesn't meet the requirements that would qualify it as a replacement for the pow in the library.

operator::* and other standard operators usually map to the primitives provided by the hardware. In case, such primitives don't exist (e.g. 64-bit long long on IA32), the compiler emulates them at a performance penalty (gcc does that in libgcc).
Same for std::pow. It is part of the standard library and isn't mandated to be implemented in a certain way. GNU libc implements pow(a,b) as exp(log(a) * b). exp and log are quite long and written for optimal performance with IEEE754 floating point in mind.
As for your sugestions:
Karatsuba multiplication for smaller numbers isn't worth it. The multiply machine instruction provided by the processor is already optimized for speed and power usage for the standard data types in use. With bigger numbers, 10-20 times the register capacity, it starts to pay off:
In the GNU MP Bignum Library, there used to be a default
KARATSUBA_THRESHOLD as high as 32 for non-modular multiplication
(that is, Karatsuba was used when n>=32w with typically w=32);
the optimal threshold for modular exponentiation tending to be
significantly higher. On modern CPUs, Karatsuba in software tends to
be non-beneficial for things like ECDSA over P-256 (n=256, w=32 or
w=64), but conceivably useful for much wider modulus as used in RSA.
Here is a list with the multiplication algorithms, GNU MP uses and their respective thresholds.
Fast exponentiation doesn't apply to non-integer powers, so it's not really comparable to pow.

A good way to check the speed of an operation is to measure it. If you run through the calculation a billion or so times and see how much time it takes to execute you have your answer there.
One thing to note. I'm lead to believe that % is fairly expensive. There is a much faster way to check if something is divisible by 2:
check_div_two(int number)
{
return ((number>>1) & 0x01);
}
This way you've just done a bit shift and compared against a mask. I'd assume it's a less expensive op.

The * operator for built-in types will almost certainly be implemented as a single CPU multiplication instruction. So ultimately this is a hardware question, not a language question. Longer code sequences, perhaps function calls, might be generated in cases where there's no direct hardware support.
It's safe to assume that chip manufacturers (Intel, AMD, et al) expend a great deal of effort making arithmetic operations as efficient as possible.

Related

How to operate (fast) on mantissa and exponent part of double or float at c++?

I use c++ for computation of various type of special functions (e.g Lambert function, iteration methods for evaluate inversions etc.). In many cases there is an obviously better approach to work with a mantissa and exponent directly.
I found many answers how to extract mantissa and exponent parts, however all of them was just "academic cases with not very effective speed of computation" that are little bit useless for me (my motivation of operate with mantissa and exponent is in improvement of computational speed). Sometimes i need to call some specific function about billion times (very expensive computing) so every saved computational work is fine. And using "frexp" which return mantissa as double is not very fit.
My questions are (for c++ compiler with IEEE 754 floating point):
1) How to read specific bit of mantissa of float/double?
2) How to read whole mantissa into integer/byte of float/double?
3) The same questions as 1), 2) for exponent.
4) The same questions as 1), 2), 3) for write.
With respect that my motivation is faster computation if I work with mantissa or exponent directly. I suppose that there must be a very simple solution.
In many cases there is an obviously better approach to work with a mantissa and exponent directly.
I know that feeling all too well from my signal processing work, but the truth is that exponents and mantissas aren't simply usable as separate numbers; IEEE754 specifies quite some special cases, and offsets etc.
I suppose that there must be a very simple solution.
Engineering experience tells me: sentences ending with "a simple solution" aren't true, usually.
"academic cases"
however, is definitely not true (I'll mention an example at the end).
There's very solid real-world usage of optimizations on IEEE754 floats. However, I find that with later x86 processors' abilities to do SIMD (single instruction, multiple data) and the overall fact that floating point is as fast as most "bit-shifty" operations, I generally suspect you're ill-advised to try to do this on a bit level yourself.
Generally, as IEEE754 is a standard, you'll find documentation for how it's stored on your specific architecture everywhere. If you've looked, you should at least have found the wikipedia article explaining how to do 1) and 2) (it's not as static as you seem to think).
What's more important:
don't try to be smarter than your compiler. You probably won't be, unless you explicitely know how to vectorize multiple identical operation.
Experiment with your specific compiler's math optimizations. As mentioned, they usually don't do much, nowadays; CPUs aren't slower doing float calculations than they are on integers, necessarily.
I'd rather look at your algorithms and look for potential for optimization there.
Also, while I'm at it, let's pitch VOLK (Vector Optimized Library of Kernels), which is a math library for signal processing, mainly. http://libvolk.org has an overview. Look into the kernels that start with 32f, for example 32f_expfast. You will notice there are different implementations, a generic and CPU-optimized ones, different for each SIMD instruction set.
You can copy the address of the fp value into an unsigned char* and treat the resulting pointer as the address of an array that overlays the fp value.
In C or C++ if x is an IEEE double then if L is a 64 bit long int, the expression
L = *((long *) &x);
will allow accessing the bits directly.
If s is a byte representing the sign (0 = '+', 1 = '-'), e is an integer representing the unbiased exponent, and f is a long int representing the fractional bits then
s = (byte)(L >> 63);
e = ((int)(L >> 52) & 0x7FF) - 0x3FF;
f = (L & 0x000FFFFFFFFFFFFF);
(If f is a normalized number, i.e., not 0, denormal, inf, nor NaN, then the last expression should have 0x0010000000000000 added to it to allow for the implicit high-order 1 bit in IEEE double format.)
Repacking the sign, exponent and fraction back into a double is similar:
L = (s << 63) + ((e + 0x3FF) << 52) + (f & 0x000FFFFFFFFFFFFF);
x = *((double *) &L);
The above code generates only a few machine instructions with no subroutine calls on 64-bit machines compiled with 64-bit code. With 32-bit code there is sometimes a call to do 64-bit arithmetic, but a good compiler will usually generate in-line code. In either case this approach is very fast.
A similar approach works for C# using L = bitConverter.DoubleToInt64Bits(x); and x = BitConverter.Int64BitsToDouble(L); or exactly as above if unsafe code is allowed.

Integer division, or float multiplication?

If one has to calculate a fraction of a given int value, say:
int j = 78;
int i = 5* j / 4;
Is this faster than doing:
int i = 1.25*j; // ?
If it is, is there a conversion factor one could use to decide which to use, as in how many int divisions can be done in the same time a one float multiplication?
Edit: I think the comments make it clear that the floating point math will be slower, but the question is, by how much? If I need to replace each float multiplication by N int divisions, for what N will this not be worth it anymore?
You've said all the values are dynamic, which makes a difference. For the specific values 5 * j / 4, the integer operations are going to be blindingly fast, because pretty much the worst case is that the compiler optimises them to two shifts and one addition, plus some messing around to cope with the possibility that j is negative. If the CPU can do better (single-cycle integer multiplication or whatever) then the compiler typically knows about it. The limits of compilers' abilities to optimize this kind of thing basically come when you're compiling for a wide family of CPUs (generating lowest-common-denominator ARM code, for example), where the compiler doesn't really know much about the hardware and therefore can't always make good choices.
I suppose that if a and b are fixed for a while (but not known at compile time), then it's possible that computing k = double(a) / b once and then int(k * x) for many different values of x, might be faster than computing a * x / b for many different values of x. I wouldn't count on it.
If all the values vary each time, then it seems unlikely that the floating-point division to compute the 1.25, followed by floating-point multiplication, is going to be any faster than the integer multiplication followed by integer division. But you never know, test it.
It's not really possible to give simple relative timings for this on modern processors, it really depends a lot on the surrounding code. The main costs in your code often aren't the "actual" ops: it's "invisible" stuff like instruction pipelines stalling on dependencies, or spilling registers to stack, or function call overhead. Whether or not the function that does this work can be inlined might easily make more difference than how the function actually does it. As far as definitive statements of performance are concerned you can basically test real code or shut up. But the chances are that if your values start as integers, doing integer ops on them is going to be faster than converting to double and doing a similar number of double ops.
It is impossible to answer this question out of context. Additionally 5*j/4 does not generally produce the same result as (int) (1.25*j), due to properties of integer and floating-point arithmetic, including rounding and overflow.
If your program is doing mostly integer operations, then the conversion of j to floating point, multiplication by 1.25, and conversion back to integer might be free because it uses floating-point units that are not otherwise engaged.
Alternatively, on some processors, the operating system might mark the floating-point state to be invalid, so that the first time a process uses it, there is an exception, the operating system saves the floating-point registers (which contain values from another process), restores or initializes the registers for your process, and returns from the exception. This would take a great deal of time, relative to normal instruction execution.
The answer also depends on characteristics of the specific processor model the program is executing on, as well as the operating system, how the compiler translates the source into assembly, and possibly even what other processes on the system are doing.
Also, the performance difference between 5*j/4 and (int) (1.25*j) is most often too small to be noticeable in a program unless it or operations like it are repeated a great many times. (And, if they are, there may be huge benefits to vectorizing the code, that is, using the Single Instruction Multiple Data [SIMD] features of many modern processors to perform several operations at once.)
In your case, 5*j/4 would be much faster than 1.25*j because division by powers of 2 can be easily manipulated by a right shift, and 5*j can be done by a single instruction on many architectures such as LEA on x86, or ADD with shift on ARM. Most others would require at most 2 instructions like j + (j >> 2) but that way it's still probably faster than a floating-point multiplication. Moreover by doing int i = 1.25*j you need 2 conversions from int to double and back, and 2 cross-domain data movements which is generally very costly
In other cases when the fraction is not representable in binary floating-point (like 3*j/10) then using int multiply/divide would be more correct (because 0.3 isn't exactly 0.3 in floating-point), and most probably faster (because the compiler can optimize out division by a constant by converting it to a multiplication)
In cases that i and j are of a floating-point type, multiplying by another floating-point value might be faster. Because moving values between float and int domains takes time and conversion between int and float also takes time as I said above
An important difference is that 5*j/4 will overflow if j is too large, but 1.25*j doesn't
That said, there's no general answer for the questions "which is faster" and "how much faster", as it depends on a specific architecture and in a specific context. You must measure on your system and decide. But if an expression is done repeatedly to a lot of values then it's time to move to SIMD
See also
Why is int * float faster than int / int?
Should I use multiplication or division?
Floating point division vs floating point multiplication

How raise to power works? Is it worth to use pow(x, 2)?

Is it more efficient to do multiplication than raise to power 2 in c++?
I am trying to do final detailed optimizations. Will the compiler treat
x*x the same as pow(x,2)? If I remember correctly, multiplication was
better for some reason, but maybe it does not matter in c++11.
Thanks
If you're comparing multiplication with the pow() standard library function then yes, multiplication is definitely faster.
I general, you should not worry about pico-optimizations like that unless you have evidence that there is a hot-spot (i.e. unless you've profiled your code under realistic scenarios and have identified a particular chunk of code. Also keep in mind that your clever tricks may actually cause performance regressions in new processors where your assumptions will no longer hold.
Algorithmic changes are where you will get the most bang for your computing buck. Focus on that.
Tinkering with multiplications and doing clever bit-hackery... eh not so much bang there* Because the current generation of optimizing compilers is really quite excellent at their job. That's not to say they can't be beat. They can, but not easily and probably only by a few people like Agner Fog.
* there are, of course, exceptions.
When it comes to performance, always make measurements to back up your assumptions. Never trust theory unless you have a benchmark that proves that theory right.
Also, keep in mind that x ^ 2 does not yield the square of 2 in C++:
#include <iostream>
int main()
{
int x = 4;
std::cout << (x ^ 2); // Prints 6
}
Live example.
The implementation of pow() typically involves logarithms, multiplication and expononentiaton, so it will DEFINITELY take longer than a simple multiplication. Most modern high end processors can do multiplication in a couple of clockcycles for integer values, and a dozen or so cycles for floating point multiply. exponentiation is either done as a complex (microcoded) instructions that takes a few dozen or more cycles, or as a series of multiplication and additions (typically with alternating positive and negative numbers, but not certainly). Exponentiation is a similar process.
On lower range processors (e.g. ARM or older x86 processors), the results are even worse. Hundreds of cycles in one floating point operation, or in some processors, even floating point calculations are a number of integer operations that perform the same steps as the float instructions on more advanced processors, so the time taken for pow() could be thousands of cycles, compared to a dozen or so for a multiplication.
Whichever choice is used, the whole calculation will be significantly longer than a simple multiplication.
The pow() function is useful when the exponent is either large, or not an integer. Even for relatively large exponents, you can do the calculation by squaring or cubing multiple times, and it will be faster than pow().
Of course, sometimes the compiler may be able to figure out what you want to do, and do it as a sequence of multiplications as a optimization. But I wouldn't rely on that.
Finally, as ALWAYS, for performance questions: If it's really important to your code, then measure it - your compiler may be smarter than you thin. If performance isn't important, then perform the calculation that is the makes the code most readable.
pow is a library function, not an operator. Unless the compiler is able to optimize out the call (which it legitimately do by taking advantage of its knowledge of the behavior of the standard library functions), calling pow() will impose the overhead of a function call and of all the extra stuff the pow() function has to do.
The second argument to pow() doesn't have to be an integer; for example pow(x, 1.0/3.0) will give you an approximation of the cube root of x. That's going to require some fairly sophisticated computations. It might fall back to repeated multiplication if the second argument is a small integral value, but then it has to check for that at run time.
If the number you want to square is an integer, pow will involve converting it to double, then converting the result back to an integer type, which is relatively expensive and could cause subtle rounding errors.
Using x * x is very likely to be faster and more reliable than pow(x, 2), and it's simpler. (In most contexts, simplicity and reliability are more important considerations than speed.)
C/C++ does not have a native "power" operator. ^ is the bitwise exclusive or (xor). Thus said, the pow function is probably what you are looking for.
Actually, for squaring an integer number, x*x is the most immediate way, and some compiler might optimize it to machine operation if available.
You should read the following link
Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?
pow(x,2) will most likely be converted to xx. However, higher powers such as pow(x,4) may not be done as optimally as possible. For example pow(x,4) could be done in 3 multiplications xxxx or in two (xx)(x*x) depending on how strict you require the floating point definition to be (by default I think it will use 3 multiplications.
It would be interesting to see what for example pow(x*x,2) produces with and without -ffast-math.
you should look into boost.math's pow function template. it takes the exponent as template parameter and automatically calculate, for example, pow<4>(x) as (x*x)*(x*x).
http://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/special/powers/ct_pow.html

Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)

Since there are two ways of implementing an AP fractional number, one is to emulate the storage and behavior of the double data type, only with more bytes, and the other is to use an existing integer APA implementation for representing a fractional number as a rational i.e. as a pair of integers, numerator and denominator, which of the two ways are more likely to deliver efficient arithmetic in terms of performance? (Memory usage is really of minor concern.)
I'm aware of the existing C/C++ libraries, some of which offer fractional APA with "floats" and other with rationals (none of them features fixed-point APA, however) and of course I could benchmark a library that relies on "float" implementation against one that makes use of rational implementation, but the results would largely depend on implementation details of those particular libraries I would have to choose randomly from the nearly ten available ones. So it's more theoretical pros and cons of the two approaches that I'm interested in (or three if take into consideration fixed-point APA).
The question is what you mean by arbitrary precision that you mention in the title. Does it mean "arbitrary, but pre-determined at compile-time and fixed at run-time"? Or does it mean "infinite, i.e. extendable at run-time to represent any rational number"?
In the former case (precision customizable at compile-time, but fixed afterwards) I'd say that one of the most efficient solutions would actually be fixed-point arithmetic (i.e. none of the two you mentioned).
Firstly, fixed-point arithmetic does not require any dedicated library for basic arithmetic operations. It is just a concept overlaid over integer arithmetic. This means that if you really need a lot of digits after the dot, you can take any big-integer library, multiply all your data, say, by 2^64 and you basically immediately get fixed-point arithmetic with 64 binary digits after the dot (at least as long as arithmetic operations are concerned, with some extra adjustments for multiplication and division). This is typically significantly more efficient than floating-point or rational representations.
Note also that in many practical applications multiplication operations are often accompanied by division operations (as in x = y * a / b) that "compensate" for each other, meaning that often it is unnecessary to perform any adjustments for such multiplications and divisions. This also contributes to efficiency of fixed-point arithmetic.
Secondly, fixed-point arithmetic provides uniform precision across the entire range. This is not true for either floating-point or rational representations, which in some applications could be a significant drawback for the latter two approaches (or a benefit, depending on what you need).
So, again, why are you considering floating-point and rational representations only. Is there something that prevents you from considering fixed-point representation?
Since no one else seemed to mention this, rationals and floats represent different sets of numbers. The value 1/3 can be represented precisely with a rational, but not a float. Even an arbitrary precision float would take infinitely many mantissa bits to represent a repeating decimal like 1/3. This is because a float is effectively like a rational but where the denominator is constrained to be a power of 2. An arbitrary precision rational can represent everything that an arbitrary precision float can and more, because the denominator can be any integer instead of just powers of 2. (That is, unless I've horribly misunderstood how arbitrary precision floats are implemented.)
This is in response to your prompt for theoretical pros and cons.
I know you didn't ask about memory usage, but here's a theoretical comparison in case anyone else is interested. Rationals, as mentioned above, specialize in numbers that can be represented simply in fractional notation, like 1/3 or 492113/203233, and floats specialize in numbers that are simple to represent in scientific notation with powers of 2, like 5*2^45 or 91537*2^203233. The amount of ascii typing needed to represent the numbers in their respective human-readable form is proportional to their memory usage.
Please correct me in the comments if I've gotten any of this wrong.
Either way, you'll need multiplication of arbitrary size integers. This will be the dominant factor in your performance since its complexity is worse than O(n*log(n)). Things like aligning operands, and adding or subtracting large integers is O(n), so we'll neglect those.
For simple addition and subtraction, you need no multiplications for floats* and 3 multiplications for rationals. Floats win hands down.
For multiplication, you need one multiplication for floats and 2 multiplications for rational numbers. Floats have the edge.
Division is a little bit more complex, and rationals might win out here, but it's by no means a certainty. I'd say it's a draw.
So overall, IMHO, the fact that addition is at least O(n*log(n)) for rationals and O(n) for floats clearly gives the win to a floating-point representation.
*It is possible that you might need one multiplication to perform addition if your exponent base and your digit base are different. Otherwise, if you use a power of 2 as your base, then aligning the operands takes a bit shift. If you don't use a power of two, then you may also have to do a multiplication by a single digit, which is also an O(n) operation.
You are effectively asking the question: "I need to participate in a race with my chosen animal. Should I choose a turtle or a snail ?".
The first proposal "emulating double" sounds like staggered precision: using an array of doubles of which the sum is the defined number. There is a paper from Douglas M. Priest "Algorithms for Arbitrary Precision Floating Point Arithmetic" which describes how to implement this arithmetic. I implemented this and my experience is very bad: The necessary overhead to make this run drops the performance 100-1000 times !
The other method of using fractionals has severe disadvantages, too: You need to implement gcd and kgv and unfortunately every prime in your numerator or denominator has a good chance to blow up your numbers and kill your performance.
So from my experience they are the worst choices one can made for performance.
I recommend the use of the MPFR library which is one of the fastest AP packages in C and C++.
Rational numbers don't give arbitrary precision, but rather the exact answer. They are, however, more expensive in terms of storage and certain operations with them become costly and some operations are not allowed at all, e.g. taking square roots, since they do not necessarily yield a rational answer.
Personally, I think in your case AP floats would be more appropriate.

Integer division algorithm

I was thinking about an algorithm in division of large numbers: dividing with remainder bigint C by bigint D, where we know the representation of C in base b, and D is of form b^k-1. It's probably the easiest to show it on an example. Let's try dividing C=21979182173 by D=999.
We write the number as sets of three digits: 21 979 182 173
We take sums (modulo 999) of consecutive sets, starting from the left: 21 001 183 356
We add 1 to those sets preceding the ones where we "went over 999": 22 001 183 356
Indeed, 21979182173/999=22001183 and remainder 356.
I've calculated the complexity and, if I'm not mistaken, the algorithm should work in O(n), n being the number of digits of C in base b representation. I've also done a very crude and unoptimized version of the algorithm (only for b=10) in C++, tested it against GMP's general integer division algorithm and it really does seem to fare better than GMP. I couldn't find anything like this implemented anywhere I looked, so I had to resort to testing it against general division.
I found several articles which discuss what seem to be quite similar matters, but none of them concentrate on actual implementations, especially in bases different than 2. I suppose that's because of the way numbers are internally stored, although the mentioned algorithm seems useful for, say, b=10, even taking that into account. I also tried contacting some other people, but, again, to no avail.
Thus, my question would be: is there an article or a book or something where the aforementioned algorithm is described, possibly discussing the implementations? If not, would it make sense for me to try and implement and test such an algorithm in, say, C/C++ or is this algorithm somehow inherently bad?
Also, I'm not a programmer and while I'm reasonably OK at programming, I admittedly don't have much knowledge of computer "internals". Thus, pardon my ignorance - it's highly possible there are one or more very stupid things in this post. Sorry once again.
Thanks a lot!
Further clarification of points raised in the comments/answers:
Thanks, everyone - as I didn't want to comment on all the great answers and advice with the same thing, I'd just like to address one point a lot of you touched on.
I am fully aware that working in bases 2^n is, generally speaking, clearly the most efficient way of doing things. Pretty much all bigint libraries use 2^32 or whatever. However, what if (and, I emphasize, it would be useful only for this particular algorithm!) we implement bigints as an array of digits in base b? Of course, we require b here to be "reasonable": b=10, the most natural case, seems reasonable enough. I know it's more or less inefficient both considering memory and time, taking into account how numbers are internally stored, but I have been able to, if my (basic and possibly somehow flawed) tests are correct, produce results faster than GMP's general division, which would give sense to implementing such an algorithm.
Ninefingers notices I'd have to use in that case an expensive modulo operation. I hope not: I can see if old+new crossed, say, 999, just by looking at the number of digits of old+new+1. If it has 4 digits, we're done. Even more, since old<999 and new<=999, we know that if old+new+1 has 4 digits (it can't have more), then, (old+new)%999 equals deleting the leftmost digit of (old+new+1), which I presume we can do cheaply.
Of course, I'm not disputing obvious limitations of this algorithm nor I claim it can't be improved - it can only divide with a certain class of numbers and we have to a priori know the representation of dividend in base b. However, for b=10, for instance, the latter seems natural.
Now, say we have implemented bignums as I outlined above. Say C=(a_1a_2...a_n) in base b and D=b^k-1. The algorithm (which could be probably much more optimized) would go like this. I hope there aren't many typos.
if k>n, we're obviously done
add a zero (i.e. a_0=0) at the beginning of C (just in case we try to divide, say, 9999 with 99)
l=n%k (mod for "regular" integers - shouldn't be too expensive)
old=(a_0...a_l) (the first set of digits, possibly with less than k digits)
for (i=l+1; i < n; i=i+k) (We will have floor(n/k) or so iterations)
new=(a_i...a_(i+k-1))
new=new+old (this is bigint addition, thus O(k))
aux=new+1 (again, bigint addition - O(k) - which I'm not happy about)
if aux has more than k digits
delete first digit of aux
old=old+1 (bigint addition once again)
fill old with zeroes at the beginning so it has as much digits as it should
(a_(i-k)...a_(i-1))=old (if i=l+1, (a _ 0...a _ l)=old)
new=aux
fill new with zeroes at the beginning so it has as much digits as it should
(a_i...a_(i+k-1)=new
quot=(a_0...a_(n-k+1))
rem=new
There, thanks for discussing this with me - as I said, this does seem to me to be an interesting "special case" algorithm to try to implement, test and discuss, if nobody sees any fatal flaws in it. If it's something not widely discussed so far, even better. Please, let me know what you think. Sorry about the long post.
Also, just a few more personal comments:
#Ninefingers: I actually have some (very basic!) knowledge of how GMP works, what it does and of general bigint division algorithms, so I was able to understand much of your argument. I'm also aware GMP is highly optimized and in a way customizes itself for different platforms, so I'm certainly not trying to "beat it" in general - that seems as much fruitful as attacking a tank with a pointed stick. However, that's not the idea of this algorithm - it works in very special cases (which GMP does not appear to cover). On an unrelated note, are you sure general divisions are done in O(n)? The most I've seen done is M(n). (And that can, if I understand correctly, in practice (Schönhage–Strassen etc.) not reach O(n). Fürer's algorithm, which still doesn't reach O(n), is, if I'm correct, almost purely theoretical.)
#Avi Berger: This doesn't actually seem to be exactly the same as "casting out nines", although the idea is similar. However, the aforementioned algorithm should work all the time, if I'm not mistaken.
Your algorithm is a variation of a base 10 algorithm known as "casting out nines". Your example is using base 1000 and "casting out" 999's (one less than the base). This used to be taught in elementary school as way to do a quick check on hand calculations. I had a high school math teacher who was horrified to learn that it wasn't being taught anymore and filled us in on it.
Casting out 999's in base 1000 won't work as a general division algorithm. It will generate values that are congruent modulo 999 to the actual quotient and remainder - not the actual values. Your algorithm is a bit different and I haven't checked if it works, but it is based on effectively using base 1000 and the divisor being 1 less than the base. If you wanted to try it for dividing by 47, you would have to convert to a base 48 number system first.
Google "casting out nines" for more information.
Edit: I originally read your post a bit too quickly, and you do know of this as a working algorithm. As #Ninefingers and #Karl Bielefeldt have stated more clearly than me in their comments, what you aren't including in your performance estimate is the conversion into a base appropriate for the particular divisor at hand.
I feel the need to add to this based on my comment. This isn't an answer, but an explanation as to the background.
A bignum library uses what are called limbs - search for mp_limb_t in the gmp source, which are usually a fixed-size integer field.
When you do something like addition, one way (albeit inefficient) to approach it is to do this:
doublelimb r = limb_a + limb_b + carryfrompreviousiteration
This double-sized limb catches the overflow of limb_a + limb_b in the case that the sum is bigger than the limb size. So if the total is bigger than 2^32 if we're using uint32_t as our limb size, the overflow can be caught.
Why do we need this? Well, what you typically do is loop through all the limbs - you've done this yourself in dividing your integer up and going through each one - but we do it LSL first (so the smallest limb first) just as you'd do arithmetic by hand.
This might seem inefficient, but this is just the C way of doing things. To really break out the big guns, x86 has adc as an instruction - add with carry. What this does is an arithmetic and on your fields and sets the carry bit if the arithmetic overflows the size of the register. The next time you do add or adc, the processor factors in the carry bit too. In subtraction it's called the borrow flag.
This also applies to shift operations. As such, this feature of the processor is crucial to what makes bignums fast. So the fact is, there's electronic circuitry in the chip for doing this stuff - doing it in software is always going to be slower.
Without going into too much detail, operations are built up from this ability to add, shift, subtract etc. They're crucial. Oh and you use the full width of your processor's register per limb if you're doing it right.
Second point - conversion between bases. You cannot take a value in the middle of a number and change it's base, because you can't account for the overflow from the digit beneath it in your original base, and that number can't account for the overflow from the digit beneath... and so on. In short, every time you want to change base, you need to convert the entire bignum from the original base to your new base back again. So you have to walk the bignum (all the limbs) three times at least. Or, alternatively, detect overflows expensively in all other operations... remember, now you need to do modulo operations to work out if you overflowed, whereas before the processor was doing it for us.
I should also like to add that whilst what you've got is probably quick for this case, bear in mind that as a bignum library gmp does a fair bit of work for you, like memory management. If you're using mpz_ you're using an abstraction above what I've described here, for starters. Finally, gmp uses hand optimised assembly with unrolled loops for just about every platform you've ever heard of, plus more. There's a very good reason it ships with Mathematica, Maple et al.
Now, just for reference, some reading material.
Modern Computer Arithmetic is a Knuth-like work for arbitrary precision libraries.
Donald Knuth, Seminumerical Algorithms (The Art of Computer Programming Volume II).
William Hart's blog on implementing algorithm's for bsdnt in which he discusses various division algorithms. If you're interested in bignum libraries, this is an excellent resource. I considered myself a good programmer until I started following this sort of stuff...
To sum it up for you: division assembly instructions suck, so people generally compute inverses and multiply instead, as you do when defining division in modular arithmetic. The various techniques that exist (see MCA) are mostly O(n).
Edit: Ok, not all of the techniques are O(n). Most of the techniques called div1 (dividing by something not bigger than a limb are O(n). When you go bigger you end up with O(n^2) complexity; this is hard to avoid.
Now, could you implement bigints as an array of digits? Well yes, of course you could. However, consider the idea just under addition
/* you wouldn't do this just before add, it's just to
show you the declaration.
*/
uint32_t* x = malloc(num_limbs*sizeof(uint32_t));
uint32_t* y = malloc(num_limbs*sizeof(uint32_t));
uint32_t* a = malloc(num_limbs*sizeof(uint32_t));
uint32_t m;
for ( i = 0; i < num_limbs; i++ )
{
m = 0;
uint64_t t = x[i] + y[i] + m;
/* now we need to work out if that overflowed at all */
if ( (t/somebase) >= 1 ) /* expensive division */
{
m = t % somebase; /* get the overflow */
}
}
/* frees somewhere */
That's a rough sketch of what you're looking at for addition via your scheme. So you have to run the conversion between bases. So you're going to need a conversion to your representation for the base, then back when you're done, because this form is just really slow everywhere else. We're not talking about the difference between O(n) and O(n^2) here, but we are talking about an expensive division instruction per limb or an expensive conversion every time you want to divide. See this.
Next up, how do you expand your division for general case division? By that, I mean when you want to divide those two numbers x and y from the above code. You can't, is the answer, without resorting to bignum-based facilities, which are expensive. See Knuth. Taking modulo a number greater than your size doesn't work.
Let me explain. Try 21979182173 mod 1099. Let's assume here for simplicity's sake that the biggest size field we can have is three digits. This is a contrived example, but the biggest field size I know if uses 128 bits using gcc extensions. Anyway, the point is, you:
21 979 182 173
Split your number into limbs. Then you take modulo and sum:
21 1000 1182 1355
It doesn't work. This is where Avi is correct, because this is a form of casting out nines, or an adaption thereof, but it doesn't work here because our fields have overflowed for a start - you're using the modulo to ensure each field stays within its limb/field size.
So what's the solution? Split your number up into a series of appropriately sized bignums? And start using bignum functions to calculate everything you need to? This is going to be much slower than any existing way of manipulating the fields directly.
Now perhaps you're only proposing this case for dividing by a limb, not a bignum, in which case it can work, but hensel division and precomputed inverses etc do to without the conversion requirement. I have no idea if this algorithm would be faster than say hensel division; it would be an interesting comparison; the problem comes with a common representation across the bignum library. The representation chosen in existing bignum libraries is for the reasons I've expanded on - it makes sense at the assembly level, where it was first done.
As a side note; you don't have to use uint32_t to represent your limbs. You use a size ideally the size of the registers of the system (say uint64_t) so that you can take advantage of assembly-optimised versions. So on a 64-bit system adc rax, rbx only sets the overflow (CF) if the result overspills 2^64 bits.
tl;dr version: the problem isn't your algorithm or idea; it's the problem of converting between bases, since the representation you need for your algorithm isn't the most efficient way to do it in add/sub/mul etc. To paraphrase knuth: This shows you the difference between mathematical elegance and computational efficiency.
If you need to frequently divide by the same divisor, using it (or a power of it) as your base makes division as cheap as bit-shifting is for base 2 binary integers.
You could use base 999 if you want; there's nothing special about using a power-of-10 base except that it makes conversion to decimal integer very cheap. (You can work one limb at a time instead of having to do a full division over the whole integer. It's like the difference between converting a binary integer to decimal vs. turning every 4 bits into a hex digit. Binary -> hex can start with the most significant bits, but converting to non-power-of-2 bases has to be LSB-first using division.)
For example, to compute the first 1000 decimal digits of Fibonacci(109) for a code-golf question with a performance requirement, my 105 bytes of x86 machine code answer used the same algorithm as this Python answer: the usual a+=b; b+=a Fibonacci iteration, but divide by (a power of) 10 every time a gets too large.
Fibonacci grows faster than carry propagates, so discarding the low decimal digits occasionally doesn't change the high digits long-term. (You keep a few extra beyond the precision you want).
Dividing by a power of 2 doesn't work, unless you keep track of how many powers of 2 you've discarded, because the eventual binary -> decimal conversion at the end would depend on that.
So for this algorithm, you have to do extended-precision addition, and division by 10 (or whatever power of 10 you want).
I stored base-109 limbs in 32-bit integer elements. Dividing by 109 is trivially cheap: just a pointer increment to skip the low limb. Instead of actually doing a memmove, I just offset the pointer used by the next add iteration.
I think division by a power of 10 other than 10^9 would be somewhat cheap, but would require an actual division on each limb, and propagating the remainder to the next limb.
Extended-precision addition is somewhat more expensive this way than with binary limbs, because I have to generate the carry-out manually with a compare: sum[i] = a[i] + b[i]; carry = sum < a; (unsigned comparison). And also manually wrap to 10^9 based on that compare, with a conditional-move instruction. But I was able to use that carry-out as an input to adc (x86 add-with-carry instruction).
You don't need a full modulo to handle the wrapping on addition, because you know you've wrapped at most once.
This wastes a just over 2 bits of each 32-bit limb: 10^9 instead of 2^32 = 4.29... * 10^9. Storing base-10 digits one per byte would be significantly less space efficient, and very much worse for performance, because an 8-bit binary addition costs the same as a 64-bit binary addition on a modern 64-bit CPU.
I was aiming for code-size: for pure performance I would have used 64-bit limbs holding base-10^19 "digits". (2^64 = 1.84... * 10^19, so this wastes less than 1 bit per 64.) This lets you get twice as much work done with each hardware add instruction. Hmm, actually this might be a problem: the sum of two limbs might wrap the 64-bit integer, so just checking for > 10^19 isn't sufficient anymore. You could work in base 5*10^18, or in base 10^18, or do more complicated carry-out detection that checks for binary carry as well as manual carry.
Storing packed BCD with one digit per 4 bit nibble would be even worse for performance, because there isn't hardware support for blocking carry from one nibble to the next within a byte.
Overall, my version ran about 10x faster than the Python extended-precision version on the same hardware (but it had room for significant optimization for speed, by dividing less often). (70 seconds or 80 seconds vs. 12 minutes)
Still, I think for this particular implementation of that algorithm (where I only needed addition and division, and division happened after every few additions), the choice of base-10^9 limbs was very good. There are much more efficient algorithms for the Nth Fibonacci number that don't need to do 1 billion extended-precision additions.