Related
I know that it is possible to use the left shift to implement multiplication by the power of two (x << 4 = x * 16).
Also, it is trivial to replace the right shift by division by a power of two (x >> 5 = x / 32).
I am wondering is it possible to replace the right shift with multiplication?
It seems to be not possible in the general case, but my question is limited to modulo 2^32 and 2^64 arithmetic (unsigned 32-bit and 64-bit values). Also, maybe it can be done if we can add other cheap instructions like + and - in addition to * to emulate the right bit shift?
I assume exotic architecture where the right shift is more expensive than other arithmetic (similar to division).
uint64_t foo(uint64_t x) {
return x >> 3; // how to avoid using right shift here?
}
There is a similar question How to perform right shifting binary multiplication? that asks how to replace multiplication of two unsigned numbers by right shift. Basically, it uses a loop internally. However, maybe if the second number is a constant, this loop can be avoided (or at least unrolled to a shorter fragment)?
"Multiply-high" aka high-mul, hmul, mulh, etc, can be used to emulate a shift-right with a constant count. Usually that's not a good trade. It's also hardly related to C++.
Normal multiplication (putting floating point stuff aside) cannot be used to implement a shift-right.
my question is limited to modulo 2^32 and 2^64 arithmetic
It doesn't help. You can use that property to "unmultiply" (sort of like divide, except not really) by odd numbers, for example if b = 5 * a then a = b * 0xCCCCCCCD, using the modular multiplicative inverse. The number being inverted must be relatively-prime relative to the modulus. Since the modulus is a power of two, the "divisor" here cannot be a power of two (except 1, but that does nothing), so a shift-right cannot be done this way.
Another way to look at it (probably simpler), is that what a multiplication does is conditionally add together a bunch of left-shifted versions of the multiplicand. Only left-shift versions, not right-shifted versions. Which of those shifted versions are selected by the multiplier doesn't matter, there are no right-shifted versions to select.
I have the following problem - I have 3 unsigned integers: a,b,c. c is the maximum value that the unsigned integer type I am using can take(so a<=c && b<=c). I know that a*b may overflow, and I know that the number (a * b / c) does not overflow (basically I need the number that one would get if he casts a,b,c to an unsigned integer type with enough bits and performs the multiplication and division). What is the fastest way to find the number (a * b / c) without having to cast to an unsigned integer type with more bits (preferably with the lowest error possible)?
I am currently using float casting, and I was wondering whether there was a method that produces results that are better and faster. I know that a*b/c can be expressed as either a/(c/b) or b/(c/a), with the option of expanding the remainder too, but depending on how many times I expand it, the error can be big, and I am not exactly sure how that compares in terms of speed to float casting and performing the division in float. I have looked at: Avoiding overflow in integer multiplication followed by division though I was hoping that with the given additional information about a,b,c a better/faster method may be used. The post I linked also doesn't mention anything about speed.
Edit: It's also preferable for the method to work fast on the GPU too, but not necessary.
Edit2: If anybody's wondering why I would need this - I am using uint to represent numbers in [0,1], since I need arithmetic operations to not accumulate error.
disclaimer: I know, unsigned integers are primitive numeric types but they don't technically overflow, I'm using the term "overflow" for all the primitive numeric types in general here.
in C or C++, according to the standard or to a particular implementation, there are primitive numeric types where given an arithmetic operation, even if this operation could possibly overflow, I can save the result + plus the part that overflows ?
Even if this sounds strange, my idea is that the registers on modern CPUs are usually much larger than a 32 bit float or a 64 bit uint64_t, so there is the potential to actually control the overflow and store it somewhere.
No, the registers are not "usually much larger than a 64 bit uint64_t".
There's an overflow flag, and for a limited number of operations (addition and subtraction), pairing this single additional bit with the result is enough to capture the entire range of outcomes.
But in general, you'd need to cast to a larger type (potentially implemented in software) to handle results that overflow the type of your input.
Any operations that do this sort of thing (for example some 32-bit processors had a 32x32 => 32 high, 32 low wide multiply instruction) will be provided by your compiler as intrinsic functions, or via inline assembly.
look, I found a 64-bit version of that, named Multiply128 and the matching __mul128 instrinsic available in Visual C++
See #Ben Voigt about larger registers.
There really may only be an overflow bit that could help you.
Another approach, without resorting to wider integers, is to test overflow yourself:
unsigned a,b,sum;
sum = a + b;
if (sum < a) {
OverflowDetected(); // mathematical result is `sum` + UINT_MAX + 1
}
Similar approach for int.
The following likely may be simplified - just don't have it at hand.
[Edit]
My below apporach has potentila UB. For a better way to detect int overflow, see Simpler method to detect int overflow
int a,b,sum;
sum = a + b;
// out-of-range only possible when the signs are the same.
if ((a < 0) == (b < 0)) {
if (a < 0) {
if (sum > b) UnderflowDetected();
}
else {
if (sum < b) OverflowDetected();
}
For floating point type, you can actually 'control' the overflow on x86 platform. With these functions "_control87, _controlfp, __control87_2", you can Gets and sets the floating-point control word. By default, the run-time libraries mask all floating-point exceptions; you can unmask them in your code, so when a overflow occurs, you'll get an exception. However, the code we write today, all assume that the floating point exceptions are masked, so if you unmask them, you'll encounter some problem.
You can use these functions to get the status word.
For floating point types the results are well defined by the hardware, and you're not going to be able to get much control without working around C/C++.
For integer types smaller than int, they will be upsized to an int by any arithmetic operation. You will probably be able to detect overflow before you coerce the result back into the smaller type.
For addition and subtraction you can detect overflow by comparing the result to the inputs. Adding two positive integers will always yield a result larger than either of the inputs, unless there was overflow.
Which of the following techniques is the best option for dividing an integer by 2 and why?
Technique 1:
x = x >> 1;
Technique 2:
x = x / 2;
Here x is an integer.
Use the operation that best describes what you are trying to do.
If you are treating the number as a sequence of bits, use bitshift.
If you are treating it as a numerical value, use division.
Note that they are not exactly equivalent. They can give different results for negative integers. For example:
-5 / 2 = -2
-5 >> 1 = -3
(ideone)
Does the first one look like dividing? No. If you want to divide, use x / 2. Compiler can optimise it to use bit-shift if possible (it's called strength reduction), which makes it a useless micro-optimisation if you do it on your own.
To pile on: there are so many reasons to favor using x = x / 2; Here are some:
it expresses your intent more clearly (assuming you're not dealing with bit twiddling register bits or something)
the compiler will reduce this to a shift operation anyway
even if the compiler didn't reduce it and chose a slower operation than the shift, the likelihood that this ends up affecting your program's performance in a measurable way is itself vanishingly small (and if it does affect it measurably, then you have an actual reason to use a shift)
if the division is going to be part of a larger expression, you're more likely to get the precedence right if you use the division operator:
x = x / 2 + 5;
x = x >> 1 + 5; // not the same as above
signed arithmetic might complicate things even more than the precedence problem mentioned above
to reiterate - the compiler will already do this for you anyway. In fact, it'll convert division by a constant to a series of shifts, adds, and multiplies for all sorts of numbers, not just powers of two. See this question for links to even more information about this.
In short, you buy nothing by coding a shift when you really mean to multiply or divide, except maybe an increased possibility of introducing a bug. It's been a lifetime since compilers weren't smart enough to optimize this kind of thing to a shift when appropriate.
Which one is the best option and why for dividing the integer number by 2?
Depends on what you mean by best.
If you want your colleagues to hate you, or to make your code hard to read, I'd definitely go with the first option.
If you want to divide a number by 2, go with the second one.
The two are not equivalent, they don't behave the same if the number is negative or inside larger expressions - bitshift has lower precedence than + or -, division has higher precedence.
You should write your code to express what its intent is. If performance is your concern, don't worry, the optimizer does a good job at these sort of micro-optimizations.
Just use divide (/), presuming it is clearer. The compiler will optimize accordingly.
I agree with other answers that you should favor x / 2 because its intent is clearer, and the compiler should optimize it for you.
However, another reason for preferring x / 2 over x >> 1 is that the behavior of >> is implementation-dependent if x is a signed int and is negative.
From section 6.5.7, bullet 5 of the ISO C99 standard:
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has
an unsigned type or if E1 has a signed type and a nonnegative value,
the value of the result is the integral part of the quotient of E1 /
2E2. If E1 has a signed type and a negative value, the resulting value
is implementation-defined.
x / 2 is clearer, and x >> 1 is not much faster (according to a micro-benchmark, about 30% faster for a Java JVM). As others have noted, for negative numbers the rounding is slightly different, so you have to consider this when you want to process negative numbers. Some compilers may automatically convert x / 2 to x >> 1 if they know the number can not be negative (even thought I could not verify this).
Even x / 2 may not use the (slow) division CPU instruction, because some shortcuts are possible, but it is still slower than x >> 1.
(This is a C / C++ question, other programming languages have more operators. For Java there is also the unsigned right shift, x >>> 1, which is again different. It allows to correctly calculate the mean (average) value of two values, so that (a + b) >>> 1 will return the mean value even for very large values of a and b. This is required for example for binary search if the array indices can get very large. There was a bug in many versions of binary search, because they used (a + b) / 2 to calculate the average. This doesn't work correctly. The correct solution is to use (a + b) >>> 1 instead.)
Knuth said:
Premature optimization is the root of all evil.
So I suggest to use x /= 2;
This way the code is easy to understand and also I think that the optimization of this operation in that form, don't mean a big difference for the processor.
Take a look at the compiler output to help you decide. I ran this test on x86-64 with
gcc (GCC) 4.2.1 20070719 [FreeBSD]
Also see compiler outputs online at godbolt.
What you see is the compiler does use a sarl (arithmetic right-shift) instruction in both cases, so it does recognize the similarity between the two expressions. If you use the divide, the compiler also needs to adjust for negative numbers. To do that it shifts the sign bit down to the lowest order bit, and adds that to the result. This fixes the off-by-one issue when shifting negative numbers, compared to what a divide would do.
Since the divide case does 2 shifts, while the explicit shift case only does one, we can now explain some of the performance differences measured by other answers here.
C code with assembly output:
For divide, your input would be
int div2signed(int a) {
return a / 2;
}
and this compiles to
movl %edi, %eax
shrl $31, %eax # (unsigned)x >> 31
addl %edi, %eax # tmp = x + (x<0)
sarl %eax # (x + 0 or 1) >> 1 arithmetic right shift
ret
similarly for shift
int shr2signed(int a) {
return a >> 1;
}
with output:
sarl %edi
movl %edi, %eax
ret
Other ISAs can do this about as efficiently, if not moreso. For example GCC for AArch64 uses:
add w0, w0, w0, lsr 31 // x += (unsigned)x>>31
asr w0, w0, 1 // x >>= 1
ret
Just an added note -
x *= 0.5 will often be faster in some VM-based languages -- notably actionscript, as the variable won't have to be checked for divide by 0.
Use x = x / 2; OR x /= 2; Because it is possible that a new programmer works on it in future. So it will be easier for him to find out what is going on in the line of code. Everyone may not be aware of such optimizations.
I am telling for the purpose of programming competitions. Generally they have very large inputs where division by 2 takes place many times and its known that input is positive or negative.
x>>1 will be better than x/2. I checked on ideone.com by running a program where more than 10^10 division by 2 operations took place. x/2 took nearly 5.5s whereas x>>1 took nearly 2.6s for same program.
I would say there are several things to consider.
Bitshift should be faster, as no special computation is really
needed to shift the bits, however as pointed out, there are
potential issues with negative numbers. If you are ensured to have
positive numbers, and are looking for speed then I would recommend
bitshift.
The division operator is very easy for humans to read.
So if you are looking for code readability, you could use this. Note
that the field of compiler optimization has come a long way, so making code easy
to read and understand is good practice.
Depending on the underlying hardware,
operations may have different speeds. Amdal's law is to make the
common case fast. So you may have hardware that can perform
different operations faster than others. For example, multiplying by
0.5 may be faster than dividing by 2. (Granted you may need to take the floor of the multiplication if you wish to enforce integer division).
If you are after pure performance, I would recommend creating some tests that could do the operations millions of times. Sample the execution several times (your sample size) to determine which one is statistically best with your OS/Hardware/Compiler/Code.
As far as the CPU is concerned, bit-shift operations are faster than division operations.
However, the compiler knows this and will optimize appropriately to the extent that it can,
so you can code in the way that makes the most sense and rest easy knowing that your code is
running efficiently. But remember that an unsigned int can (in some cases) be optimized better than an int for reasons previously pointed out.
If you don't need signed arithmatic, then don't include the sign bit.
x = x / 2; is the suitable code to use.. but an operation depend on your own program of how the output you wanted to produce.
Make your intentions clearer...for example, if you want to divide, use x / 2, and let the compiler optimize it to shift operator (or anything else).
Today's processors won't let these optimizations have any impact on the performance of your programs.
The answer to this will depend on the environment you're working under.
If you're working on an 8-bit microcontroller or anything without hardware support for multiplication, bit shifting is expected and commonplace, and while the compiler will almost certainly turn x /= 2 into x >>= 1, the presence of a division symbol will raise more eyebrows in that environment than using a shift to effect a division.
If you're working in a performance-critical environment or section of code, or your code could be compiled with compiler optimization off, x >>= 1 with a comment explaining its reasoning is probably best just for clarity of purpose.
If you're not under one of the above conditions, make your code more readable by simply using x /= 2. Better to save the next programmer who happens to look at your code the 10 second double-take on your shift operation than to needlessly prove you knew the shift was more efficient sans compiler optimization.
All these assume unsigned integers. The simple shift is probably not what you want for signed. Also, DanielH brings up a good point about using x *= 0.5 for certain languages like ActionScript.
mod 2, test for = 1. dunno the syntax in c. but this may be fastest.
generaly the right shift divides :
q = i >> n; is the same as: q = i / 2**n;
this is sometimes used to speed up programs at the cost of clarity. I don't think you should do it . The compiler is smart enough to perform the speedup automatically. This means that putting in a shift gains you nothing at the expense of clarity.
Take a look at this page from Practical C++ Programming.
Obviously, if you are writing your code for the next guy who reads it, go for the clarity of "x/2".
However, if speed is your goal, try it both ways and time the results. A few months ago I worked on a bitmap convolution routine which involved stepping through an array of integers and dividing each element by 2. I did all kinds of things to optimize it including the old trick of substituting "x>>1" for "x/2".
When I actually timed both ways I discovered to my surprise that x/2 was faster than x>>1
This was using Microsoft VS2008 C++ with the default optimizations turned on.
In terms of performance. CPU's shift operations are significantly faster than divide op-codes.
So dividing by two or multiplying by 2 etc all benefit from shift operations.
As to the look and feel. As engineers when did we become so attached to cosmetics that even beautiful ladies don't use! :)
X/Y is a correct one...and " >> " shifting operator..if we want two divide a integer we can use (/) dividend operator. shift operator is used to shift the bits..
x=x/2;
x/=2; we can use like this..
I was thinking about an algorithm in division of large numbers: dividing with remainder bigint C by bigint D, where we know the representation of C in base b, and D is of form b^k-1. It's probably the easiest to show it on an example. Let's try dividing C=21979182173 by D=999.
We write the number as sets of three digits: 21 979 182 173
We take sums (modulo 999) of consecutive sets, starting from the left: 21 001 183 356
We add 1 to those sets preceding the ones where we "went over 999": 22 001 183 356
Indeed, 21979182173/999=22001183 and remainder 356.
I've calculated the complexity and, if I'm not mistaken, the algorithm should work in O(n), n being the number of digits of C in base b representation. I've also done a very crude and unoptimized version of the algorithm (only for b=10) in C++, tested it against GMP's general integer division algorithm and it really does seem to fare better than GMP. I couldn't find anything like this implemented anywhere I looked, so I had to resort to testing it against general division.
I found several articles which discuss what seem to be quite similar matters, but none of them concentrate on actual implementations, especially in bases different than 2. I suppose that's because of the way numbers are internally stored, although the mentioned algorithm seems useful for, say, b=10, even taking that into account. I also tried contacting some other people, but, again, to no avail.
Thus, my question would be: is there an article or a book or something where the aforementioned algorithm is described, possibly discussing the implementations? If not, would it make sense for me to try and implement and test such an algorithm in, say, C/C++ or is this algorithm somehow inherently bad?
Also, I'm not a programmer and while I'm reasonably OK at programming, I admittedly don't have much knowledge of computer "internals". Thus, pardon my ignorance - it's highly possible there are one or more very stupid things in this post. Sorry once again.
Thanks a lot!
Further clarification of points raised in the comments/answers:
Thanks, everyone - as I didn't want to comment on all the great answers and advice with the same thing, I'd just like to address one point a lot of you touched on.
I am fully aware that working in bases 2^n is, generally speaking, clearly the most efficient way of doing things. Pretty much all bigint libraries use 2^32 or whatever. However, what if (and, I emphasize, it would be useful only for this particular algorithm!) we implement bigints as an array of digits in base b? Of course, we require b here to be "reasonable": b=10, the most natural case, seems reasonable enough. I know it's more or less inefficient both considering memory and time, taking into account how numbers are internally stored, but I have been able to, if my (basic and possibly somehow flawed) tests are correct, produce results faster than GMP's general division, which would give sense to implementing such an algorithm.
Ninefingers notices I'd have to use in that case an expensive modulo operation. I hope not: I can see if old+new crossed, say, 999, just by looking at the number of digits of old+new+1. If it has 4 digits, we're done. Even more, since old<999 and new<=999, we know that if old+new+1 has 4 digits (it can't have more), then, (old+new)%999 equals deleting the leftmost digit of (old+new+1), which I presume we can do cheaply.
Of course, I'm not disputing obvious limitations of this algorithm nor I claim it can't be improved - it can only divide with a certain class of numbers and we have to a priori know the representation of dividend in base b. However, for b=10, for instance, the latter seems natural.
Now, say we have implemented bignums as I outlined above. Say C=(a_1a_2...a_n) in base b and D=b^k-1. The algorithm (which could be probably much more optimized) would go like this. I hope there aren't many typos.
if k>n, we're obviously done
add a zero (i.e. a_0=0) at the beginning of C (just in case we try to divide, say, 9999 with 99)
l=n%k (mod for "regular" integers - shouldn't be too expensive)
old=(a_0...a_l) (the first set of digits, possibly with less than k digits)
for (i=l+1; i < n; i=i+k) (We will have floor(n/k) or so iterations)
new=(a_i...a_(i+k-1))
new=new+old (this is bigint addition, thus O(k))
aux=new+1 (again, bigint addition - O(k) - which I'm not happy about)
if aux has more than k digits
delete first digit of aux
old=old+1 (bigint addition once again)
fill old with zeroes at the beginning so it has as much digits as it should
(a_(i-k)...a_(i-1))=old (if i=l+1, (a _ 0...a _ l)=old)
new=aux
fill new with zeroes at the beginning so it has as much digits as it should
(a_i...a_(i+k-1)=new
quot=(a_0...a_(n-k+1))
rem=new
There, thanks for discussing this with me - as I said, this does seem to me to be an interesting "special case" algorithm to try to implement, test and discuss, if nobody sees any fatal flaws in it. If it's something not widely discussed so far, even better. Please, let me know what you think. Sorry about the long post.
Also, just a few more personal comments:
#Ninefingers: I actually have some (very basic!) knowledge of how GMP works, what it does and of general bigint division algorithms, so I was able to understand much of your argument. I'm also aware GMP is highly optimized and in a way customizes itself for different platforms, so I'm certainly not trying to "beat it" in general - that seems as much fruitful as attacking a tank with a pointed stick. However, that's not the idea of this algorithm - it works in very special cases (which GMP does not appear to cover). On an unrelated note, are you sure general divisions are done in O(n)? The most I've seen done is M(n). (And that can, if I understand correctly, in practice (Schönhage–Strassen etc.) not reach O(n). Fürer's algorithm, which still doesn't reach O(n), is, if I'm correct, almost purely theoretical.)
#Avi Berger: This doesn't actually seem to be exactly the same as "casting out nines", although the idea is similar. However, the aforementioned algorithm should work all the time, if I'm not mistaken.
Your algorithm is a variation of a base 10 algorithm known as "casting out nines". Your example is using base 1000 and "casting out" 999's (one less than the base). This used to be taught in elementary school as way to do a quick check on hand calculations. I had a high school math teacher who was horrified to learn that it wasn't being taught anymore and filled us in on it.
Casting out 999's in base 1000 won't work as a general division algorithm. It will generate values that are congruent modulo 999 to the actual quotient and remainder - not the actual values. Your algorithm is a bit different and I haven't checked if it works, but it is based on effectively using base 1000 and the divisor being 1 less than the base. If you wanted to try it for dividing by 47, you would have to convert to a base 48 number system first.
Google "casting out nines" for more information.
Edit: I originally read your post a bit too quickly, and you do know of this as a working algorithm. As #Ninefingers and #Karl Bielefeldt have stated more clearly than me in their comments, what you aren't including in your performance estimate is the conversion into a base appropriate for the particular divisor at hand.
I feel the need to add to this based on my comment. This isn't an answer, but an explanation as to the background.
A bignum library uses what are called limbs - search for mp_limb_t in the gmp source, which are usually a fixed-size integer field.
When you do something like addition, one way (albeit inefficient) to approach it is to do this:
doublelimb r = limb_a + limb_b + carryfrompreviousiteration
This double-sized limb catches the overflow of limb_a + limb_b in the case that the sum is bigger than the limb size. So if the total is bigger than 2^32 if we're using uint32_t as our limb size, the overflow can be caught.
Why do we need this? Well, what you typically do is loop through all the limbs - you've done this yourself in dividing your integer up and going through each one - but we do it LSL first (so the smallest limb first) just as you'd do arithmetic by hand.
This might seem inefficient, but this is just the C way of doing things. To really break out the big guns, x86 has adc as an instruction - add with carry. What this does is an arithmetic and on your fields and sets the carry bit if the arithmetic overflows the size of the register. The next time you do add or adc, the processor factors in the carry bit too. In subtraction it's called the borrow flag.
This also applies to shift operations. As such, this feature of the processor is crucial to what makes bignums fast. So the fact is, there's electronic circuitry in the chip for doing this stuff - doing it in software is always going to be slower.
Without going into too much detail, operations are built up from this ability to add, shift, subtract etc. They're crucial. Oh and you use the full width of your processor's register per limb if you're doing it right.
Second point - conversion between bases. You cannot take a value in the middle of a number and change it's base, because you can't account for the overflow from the digit beneath it in your original base, and that number can't account for the overflow from the digit beneath... and so on. In short, every time you want to change base, you need to convert the entire bignum from the original base to your new base back again. So you have to walk the bignum (all the limbs) three times at least. Or, alternatively, detect overflows expensively in all other operations... remember, now you need to do modulo operations to work out if you overflowed, whereas before the processor was doing it for us.
I should also like to add that whilst what you've got is probably quick for this case, bear in mind that as a bignum library gmp does a fair bit of work for you, like memory management. If you're using mpz_ you're using an abstraction above what I've described here, for starters. Finally, gmp uses hand optimised assembly with unrolled loops for just about every platform you've ever heard of, plus more. There's a very good reason it ships with Mathematica, Maple et al.
Now, just for reference, some reading material.
Modern Computer Arithmetic is a Knuth-like work for arbitrary precision libraries.
Donald Knuth, Seminumerical Algorithms (The Art of Computer Programming Volume II).
William Hart's blog on implementing algorithm's for bsdnt in which he discusses various division algorithms. If you're interested in bignum libraries, this is an excellent resource. I considered myself a good programmer until I started following this sort of stuff...
To sum it up for you: division assembly instructions suck, so people generally compute inverses and multiply instead, as you do when defining division in modular arithmetic. The various techniques that exist (see MCA) are mostly O(n).
Edit: Ok, not all of the techniques are O(n). Most of the techniques called div1 (dividing by something not bigger than a limb are O(n). When you go bigger you end up with O(n^2) complexity; this is hard to avoid.
Now, could you implement bigints as an array of digits? Well yes, of course you could. However, consider the idea just under addition
/* you wouldn't do this just before add, it's just to
show you the declaration.
*/
uint32_t* x = malloc(num_limbs*sizeof(uint32_t));
uint32_t* y = malloc(num_limbs*sizeof(uint32_t));
uint32_t* a = malloc(num_limbs*sizeof(uint32_t));
uint32_t m;
for ( i = 0; i < num_limbs; i++ )
{
m = 0;
uint64_t t = x[i] + y[i] + m;
/* now we need to work out if that overflowed at all */
if ( (t/somebase) >= 1 ) /* expensive division */
{
m = t % somebase; /* get the overflow */
}
}
/* frees somewhere */
That's a rough sketch of what you're looking at for addition via your scheme. So you have to run the conversion between bases. So you're going to need a conversion to your representation for the base, then back when you're done, because this form is just really slow everywhere else. We're not talking about the difference between O(n) and O(n^2) here, but we are talking about an expensive division instruction per limb or an expensive conversion every time you want to divide. See this.
Next up, how do you expand your division for general case division? By that, I mean when you want to divide those two numbers x and y from the above code. You can't, is the answer, without resorting to bignum-based facilities, which are expensive. See Knuth. Taking modulo a number greater than your size doesn't work.
Let me explain. Try 21979182173 mod 1099. Let's assume here for simplicity's sake that the biggest size field we can have is three digits. This is a contrived example, but the biggest field size I know if uses 128 bits using gcc extensions. Anyway, the point is, you:
21 979 182 173
Split your number into limbs. Then you take modulo and sum:
21 1000 1182 1355
It doesn't work. This is where Avi is correct, because this is a form of casting out nines, or an adaption thereof, but it doesn't work here because our fields have overflowed for a start - you're using the modulo to ensure each field stays within its limb/field size.
So what's the solution? Split your number up into a series of appropriately sized bignums? And start using bignum functions to calculate everything you need to? This is going to be much slower than any existing way of manipulating the fields directly.
Now perhaps you're only proposing this case for dividing by a limb, not a bignum, in which case it can work, but hensel division and precomputed inverses etc do to without the conversion requirement. I have no idea if this algorithm would be faster than say hensel division; it would be an interesting comparison; the problem comes with a common representation across the bignum library. The representation chosen in existing bignum libraries is for the reasons I've expanded on - it makes sense at the assembly level, where it was first done.
As a side note; you don't have to use uint32_t to represent your limbs. You use a size ideally the size of the registers of the system (say uint64_t) so that you can take advantage of assembly-optimised versions. So on a 64-bit system adc rax, rbx only sets the overflow (CF) if the result overspills 2^64 bits.
tl;dr version: the problem isn't your algorithm or idea; it's the problem of converting between bases, since the representation you need for your algorithm isn't the most efficient way to do it in add/sub/mul etc. To paraphrase knuth: This shows you the difference between mathematical elegance and computational efficiency.
If you need to frequently divide by the same divisor, using it (or a power of it) as your base makes division as cheap as bit-shifting is for base 2 binary integers.
You could use base 999 if you want; there's nothing special about using a power-of-10 base except that it makes conversion to decimal integer very cheap. (You can work one limb at a time instead of having to do a full division over the whole integer. It's like the difference between converting a binary integer to decimal vs. turning every 4 bits into a hex digit. Binary -> hex can start with the most significant bits, but converting to non-power-of-2 bases has to be LSB-first using division.)
For example, to compute the first 1000 decimal digits of Fibonacci(109) for a code-golf question with a performance requirement, my 105 bytes of x86 machine code answer used the same algorithm as this Python answer: the usual a+=b; b+=a Fibonacci iteration, but divide by (a power of) 10 every time a gets too large.
Fibonacci grows faster than carry propagates, so discarding the low decimal digits occasionally doesn't change the high digits long-term. (You keep a few extra beyond the precision you want).
Dividing by a power of 2 doesn't work, unless you keep track of how many powers of 2 you've discarded, because the eventual binary -> decimal conversion at the end would depend on that.
So for this algorithm, you have to do extended-precision addition, and division by 10 (or whatever power of 10 you want).
I stored base-109 limbs in 32-bit integer elements. Dividing by 109 is trivially cheap: just a pointer increment to skip the low limb. Instead of actually doing a memmove, I just offset the pointer used by the next add iteration.
I think division by a power of 10 other than 10^9 would be somewhat cheap, but would require an actual division on each limb, and propagating the remainder to the next limb.
Extended-precision addition is somewhat more expensive this way than with binary limbs, because I have to generate the carry-out manually with a compare: sum[i] = a[i] + b[i]; carry = sum < a; (unsigned comparison). And also manually wrap to 10^9 based on that compare, with a conditional-move instruction. But I was able to use that carry-out as an input to adc (x86 add-with-carry instruction).
You don't need a full modulo to handle the wrapping on addition, because you know you've wrapped at most once.
This wastes a just over 2 bits of each 32-bit limb: 10^9 instead of 2^32 = 4.29... * 10^9. Storing base-10 digits one per byte would be significantly less space efficient, and very much worse for performance, because an 8-bit binary addition costs the same as a 64-bit binary addition on a modern 64-bit CPU.
I was aiming for code-size: for pure performance I would have used 64-bit limbs holding base-10^19 "digits". (2^64 = 1.84... * 10^19, so this wastes less than 1 bit per 64.) This lets you get twice as much work done with each hardware add instruction. Hmm, actually this might be a problem: the sum of two limbs might wrap the 64-bit integer, so just checking for > 10^19 isn't sufficient anymore. You could work in base 5*10^18, or in base 10^18, or do more complicated carry-out detection that checks for binary carry as well as manual carry.
Storing packed BCD with one digit per 4 bit nibble would be even worse for performance, because there isn't hardware support for blocking carry from one nibble to the next within a byte.
Overall, my version ran about 10x faster than the Python extended-precision version on the same hardware (but it had room for significant optimization for speed, by dividing less often). (70 seconds or 80 seconds vs. 12 minutes)
Still, I think for this particular implementation of that algorithm (where I only needed addition and division, and division happened after every few additions), the choice of base-10^9 limbs was very good. There are much more efficient algorithms for the Nth Fibonacci number that don't need to do 1 billion extended-precision additions.