Bit fiddling: Negate an int, only if it's negative

Bit fiddling: Negate an int, only if it's negative - c++

I'm working through a codebase that uses the following encoding to indicate sampling with replacement: We maintain an array of ints as indicators of position and presence in the sample, where positive ints indicate a position in another array and negative ints indicate that we should not use the data point in this iteration.
Example:
data_points: [...] // Objects vector of length 5 == position.size()
std::vector<int> position: [3, 4, -3, 1, -2]
would mean that the first element in data_points should go to bucket 3, the second to bucket 4, and the fourth to bucket 1.
The negative values indicate that for this iteration we won't be using those data points, namely the 3rd and 5th data points are marked as excluded because their values in position are negative and have been set with position[i] = ~position[i].
The trick is that we might perform this multiple times, but the positions of the data points in the index should not change. So in the next iteration, if we want to exclude data point 1, and include data point 5 we could do something like:
position[0] = ~position[0] // Bit-wise complement flips the sign on ints and subtracts 1
position[4] = ~position[4]
This will change the position vector into
std::vector<int> position: [-4, 4, -3, 1, 1]
Now on to the question: At the end of each round I want to reset all signs to positive, i.e. position should become [3, 4, 3, 1, 2].
Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
Edit: The above is wrong, the complement of a (int) will give -(a + 1) and depends on the representation of ints as noted in the answers. So the original question of simply taking the positive value of an existing one does not apply, we actually need to perform a bitwise complement to get the original value.
Example:
position[0] = 1
position[0] = ~position[0] // Now position[0] is -2
// So if we did
position[0] = std::abs(position[0]) // This is wrong, position[0] is now 2!
// If we want to reset it to 1, we need to do another complement
position[0] = ~position[0] // Now position[0] is again 1

I recommend not attempting bit fiddling. Partly because you're dealing with signed numbers, and you would throw away portability if you fiddle. Partly because bit fiddling is not as readable as reusable functions.
A simple solution:
std::for_each(position.begin(), position.end(), [](int v) {
return std::abs(v);
});
why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
It doesn't. Not in general anyway. It does so only on systems that use 1's complement representation for negative numbers, and the reason for that is simply because that's how the representation is specified. A negative number is represented by binary complement of the positive value.
Most commonly used representation these days is 2's complement which doesn't work that way.

Probably the first go-to source for bit twiddling hacks: The eponymous site
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I would question the assumption that position[i] = std::abs(position[i]) has worse performance. You should definitely have profiling results that demonstrate the bit hack being superior before you check in that kind of code.
Feel free to play around with a quick benchmark (with disassembly) of both - I don't think there is a difference in speed:
gcc 8.2
clang 6.0
Also take a look at the assembly that is actually generated:
https://godbolt.org/z/Ghcw_c
Evidently, clang sees your bithack and is not impressed - it generates a conditional move in all cases (which is branch-free). gcc does as you tell it, but has two more implementations of abs in store, some making use of register semantics of the target architecture.
And if you get into (auto-)vectorization, things get even more muddy. You'll have to profile regardless.
Conclusion: Just write std::abs - your compiler will do all the bit-twiddling for you.

To answer the extended question: Again, first write the obvious and intuitive code and then check that your compiler does the right thing: Look ma, no branches!
If you let it have its fun with auto-vectorization then you probably won't understand (or be good at judging) the assembly, so you have to profile anyway. Concrete example:
https://godbolt.org/z/oaaOwJ. clang likes to also unroll the auto-vectorized loop while gcc is more conservative. In any case, it's still branch-free.
Chances are that your compiler understands the nitty-gritty details of instruction scheduling on your target platform better than you. If you don't obscure your intent with bit-magic, it'll do a good job by itself. If it's still a hot spot in your code, you can then go and see if you can hand-craft a better version (but that will probably have to be in assembly).

Use functions to signify intent. Allow the compiler's optimiser to do a better job than you ever will.
#include <cmath>
void include_index(int& val)
{
val = std::abs(val);
}
void exclude_index(int& val)
{
val = -std::abs(val);
}
bool is_included(int const& val)
{
return val > 0;
}
Sample output from godbolt's gcc8 x86_64 compiler (note that it's all bit-twiddling and there are no conditional jumps - the bane of high performance computing):
include_index(int&):
mov eax, DWORD PTR [rdi]
sar eax, 31
xor DWORD PTR [rdi], eax
sub DWORD PTR [rdi], eax
ret
exclude_index(int&):
mov eax, DWORD PTR [rdi]
mov edx, DWORD PTR [rdi]
sar eax, 31
xor edx, eax
sub eax, edx
mov DWORD PTR [rdi], eax
ret
is_included(int const&):
mov eax, DWORD PTR [rdi]
test eax, eax
setg al
ret
https://godbolt.org/z/ni6DOk

Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
This question deserves an answer all by itself since everyone will tell you that this is how you do it, but nobody ever tells you why.
Notice that 1 - 0 = 1 and 1 - 1 = 0. What this means is that if we do 1 - b, where b is a single bit, the result is the opposite of b, or not b (~b). Also notice how this subtraction will never produce a borrow, this is very important, since b can only be at most 1.
Also notice that subtracting a number with n bits simply means performing n 1-bit subtractions, while taking care of borrows. But our special case will never procude a borrow.
Basically we have created a mathematical definiton for the bitwise not operation. To flip a bit b, do 1 - b. If we wanna flip an n bit number, do this for every bit. But doing n subtractions in sequence is the same as subtracting two n bit numbers. So if we wanna calculate the bitwise not of an 8-bit number, a, we simply do 11111111 - a, and the same for any n bit number. Once again this works because subtracting a bit from 1 will never produce a borrow.
But what is the sequence of n "1" bits? It's the value 2^n - 1. So taking the bitwise not of a number, a, is the same as calculating 2^n - 1 - a.
Now, numbers inside a computer are stored as numbers modulo 2^n. This is because we only have a limited number of bits available. You may know that if you work with 8 bits and you do 255 + 1 you'll get 0. This is because an 8-bit number is a number modulo 2^8 = 256, and 255 + 1 = 256. 256 is obviously equal to 0 modulo 256.
But why not do the same backwards? By this logic, 0 - 1 = 255, right? This is indeed correct. Mathematically, -1 and 255 are "congruent" modulo 256. Congruent essentially means equal to, but it's used to differentiate between regular equality and equality in modular arithmetic.
Notice in fact how 0 is also congruent to 256 modulo 256. So 0 - 1 = 256 - 1 = 255. 256 is our modulus, 2^8. But if bitwise not is defined as 2^n - 1 - a, then we have ~a = 2^8 - 1 - a. You'll notice how we have that - 1 in the middle. We can remove that by adding 1.
So we now have ~a + 1 = 2^n - 1 - a + 1 = 2^n - a. But 2^n - a is the negative a modulo n. And so here we have our negative number. This is called two's complement, and it's used in pretty much every modern processor, because it's the mathematical definition of a negative number in modular arithmetic modulo 2^n, and because numbers inside a processor work as if they were in modulo 2^n the math just works out by itself. You can add and subtract without doing any extra steps. Multiplication and division do require "sign extension", but that's just a quirk of how those operations are defined, the meaning of the number doesn't change when extending the sign.
Of course with this method you lose a bit, because you now have half of the numbers being positive and the other half negative, but you can't just magically add a bit to your processor, so the new range of value you can represent is from -2^(n-1) to 2^(n-1) - 1 inclusive.
Alternatively, you can keep the number as it is and not add 1 at the end. This is known as the one's complement. Of course, this is not quite the same as the mathematical negative, so adding, subtracting, multiplying and dividing don't just work out of the box, you need extra steps to adjust the result. This is why two's complement is the de facto standard for signed arithmetic. There's also the problem that, in one's complement, both 0 and 2^n - 1 represent the same quantity, zero, while in two's complement, negative 0 is still correctly 0 (because ~0 + 1 = 2^n - 1 + 1 = 2^n = 0). I think one's complement is used in the Internet Protocol as a checksum, but other than that it has very limited purpose.
But be aware, "de facto" standard means that this is just what everyone does, but there is no rule that says that it MUST be done this way, so always make sure to check the documentation of you target architecture to make sure you're doing the right thing. Even though let's be honest, chances of finding a useful one's complement processor out there nowadays are pretty much null unless you're working on some extremely specific architecture, but still, better be safe than sorry.

"Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?"
Do you need to keep the numeric value of the number being changed from a negative value?
If not, you can use std::max to set the negative values to zero
iValue = std::max(iValue, 0); // returns zero if iValue is less than zero
If you need to keep the numeric value, but just change from negative to positive, then
iValue = std::abs(iValue); // always returns a positive value of iValue

Related

How to distinguish odd and even for big numbers more efficiently?

Please let me know by comments if there are already some similar questions.
When we usually try to distinguish odd and even numbers we can try the following code,
in C++.
int main() {
int n=10;
for(n; n>0; n--){
if(n%2==0) std::cout<< "even" << '\n';
if(n%2==1) std::cout<< "odd" << '\n';
}
}
I'm sure more than 99% of undergraduates, even professionals, would use condition as "if (n%2==0)...else" to distinguish between odd and even.
However, when the range of number gets big enough, it seems to me that "if (n%2==0)...else" method could be quite inefficient. Let me reason them why.
int main() {
int n=100000;
for(n; n>0; n--){
if(n%2==0) std::cout<< "even" << '\n';
if(n%2==1) std::cout<< "odd" << '\n';
}
}
when the integer "n" was small then dividing each of positive integer smaller than it wasn't a big deal.
However, when it becomes big, wouldn't there be some more efficient way than just dividing them?
We, humans, usually don't calculate modulo 2 to know whether "10^1000 + 1" is odd or even.
It goes same for ""10^1000 + 2", "10^1000 + 3", and so on. We can just know the answer by looking at the last digit of integer.
I'm not an expert in CS, so even though I'm not sure about this information, I heard machines are much more friendly to binary numbers than humans do. If so, wouldn't computers can distinguish between odd and even numbers more faster just by looking at the last digit of their inputs, whether they are 0 or 1?
If there is some direct answer to this, I'm sure many of intermediate level of numerical algorithms could benefit from the answer. Looking forward for someone's help.
Thanks.

Is the performance time for 50000000%2 and 5%2 really the same?
This is making the assumption that the compiler "looks at the number" and "knows" how big its value is. Thats not the case for divisions. The compiler sees an int and performs some operations on that int. Different values are merely different bits set in that int which always has the same number of bytes that need to be considered when carrying out a division.
On the other hand, %2 is such a common operation that the compiler indeed "knows where to look": The last bit.
Consider this two functions:
bool is_odd1(int n) { return (n%2);}
bool is_odd2(int n) { return (n&1);}
Both return true when n is odd.
is_odd1 uses % which is defined as remainder after division. Though, just because it is defined like that does not imply that it must be implemented like this. The compiler will emit code that produces the result in accordance with the definition, and that code can be expected to be very efficient.
is_odd2 only considers the lowest bit of n to check if it is set or not.
With optimizations turned on gcc produces exact same output for both:
is_odd1(int):
mov eax, edi
and eax, 1
ret
is_odd2(int):
mov eax, edi
and eax, 1
ret
Both function do nothing but check if the lowest bit of n is set.
Live Demo.
Conclusion: Do not worry about such micro optimizations. If possible they are implemented in the compiler. Rather write code for clarity and readability.
However, do not introduce inefficiencies on the large scale. For example if you had written your code like this, there would be no need to rely on compiler optimizations:
for(; n>0; n-=2){
std::cout<< "even" << '\n';
std::cout<< "odd" << '\n';
}
Though, this is anyhow not a good example, because printing something to the console is magnitudes more expensive than checking if a single bit is set or not.

If you look at a number written in decimal, you can know quickly whether it is odd or even:
if it ends with 0, 2, 4, 6, or 8, then it is even;
if it ends with 1, 3, 5, 7, or 9, then it is odd.
Numbers on a computer are stored in binary. Binary is almost the same as decimal, except it uses base-2 instead of base-10, and thus uses only bits 0 or 1 instead of digits 0 to 9.
To check whether a number written in binary is odd or even, look at its last bit:
if it ends with 0, then it is even;
if it ends with 1, then it is odd.
When you write n % 2 == 0 is C++, your compiler optimises the code and doesn't perform a division at all. Instead, it simply looks at the last bit of n and checks whether it is a 0 or a 1.

Never test parity by testing the remainder (remainder, as well as division, is a very slow operation). It is enough to test the least significant bit by using bit masking (n & 1 == 0).
[By the way, smart optimizers will automatically substitute a % 2 test by masking. Another option is to use shifts: a == (a >> 1) << 1 detects an even number.]
If your number is a BigNum, it suffices to check the parity of the low order word in the representation.

How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Assembler division x86 strange numbers [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?

Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.

Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).

In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.

Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl

I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Multi-Statement Arithmetic in c++

This may have been asked already, but I was unable to find it on this forum. I had a general question about integer arithmetic in c++ when doing arithmetic on large integers.
unsigned int value = 500000;
value = (value * value) % 99;
The correct value to the above code would be 25, however when this is implemented in C++ I get a value of 90.
I looked at the disassembly and that gave me a slight idea as to why it may be coming back with an erroneous value, the disassembly is as follows
unsigned int value = 500000;
00D760B5 mov dword ptr [value],7A120h
value = ((value * value) % 99);
00D760BC mov eax,dword ptr [value]
00D760BF imul eax,dword ptr [value]
00D760C3 xor edx,edx
00D760C5 mov ecx,63h
00D760CA div eax,ecx
00D760CC mov dword ptr [value],edx
It appears to be placing it into a 32 bit register, and that is why the result is coming back erroneous.
Nevertheless, my question is how can I mimic the language to give me the correct answer? I'm sure there's an easy and straightforward way to do it but I'm drawing a blank.

You should consider using a larger integer type, such as unsigned long or unsigned long long. On my machine both of these give you eight bytes of working space, which allows for a maximum value of 18,446,744,073,709,551,615. This is comfortably large enough to fit 500000^2=250,000,000,000.
But, in my opinion, that's a poor solution. You should, instead, use better mathematics. The follow modular identity should let you do what you want:
(ab) mod n = [(a mod n)(b mod n)] mod n.
In you case, you'd write:
unsigned int value = 500000;
value = ((value%99)*(value%99))%99;

The numbers you are working with that are really big. It some other languages like python they have built in libraries that handle really big numbers in special ways because a single 32-bit integer cant hold that much. C++ does not do that, and as ForceBru put, the result is gigantic and is giving your overflow errors.
If you want to deal with long numbers you could manipulate polynomials in coefficient representation for big integers, or look for a scientific computing library online.

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.

Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient.
Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.
But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes
out += tab[out]
and with any luck at all, that will turn out to be one 16-bit add and a store operation.

If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?
This algorithm is used to convert binary numbers to decimal without using modulo or division.

Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.
0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10
Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.
0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
+ 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
= 28 % 10
This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).
But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js