Multi-Statement Arithmetic in c++ - c++

This may have been asked already, but I was unable to find it on this forum. I had a general question about integer arithmetic in c++ when doing arithmetic on large integers.
unsigned int value = 500000;
value = (value * value) % 99;
The correct value to the above code would be 25, however when this is implemented in C++ I get a value of 90.
I looked at the disassembly and that gave me a slight idea as to why it may be coming back with an erroneous value, the disassembly is as follows
unsigned int value = 500000;
00D760B5 mov dword ptr [value],7A120h
value = ((value * value) % 99);
00D760BC mov eax,dword ptr [value]
00D760BF imul eax,dword ptr [value]
00D760C3 xor edx,edx
00D760C5 mov ecx,63h
00D760CA div eax,ecx
00D760CC mov dword ptr [value],edx
It appears to be placing it into a 32 bit register, and that is why the result is coming back erroneous.
Nevertheless, my question is how can I mimic the language to give me the correct answer? I'm sure there's an easy and straightforward way to do it but I'm drawing a blank.

You should consider using a larger integer type, such as unsigned long or unsigned long long. On my machine both of these give you eight bytes of working space, which allows for a maximum value of 18,446,744,073,709,551,615. This is comfortably large enough to fit 500000^2=250,000,000,000.
But, in my opinion, that's a poor solution. You should, instead, use better mathematics. The follow modular identity should let you do what you want:
(ab) mod n = [(a mod n)(b mod n)] mod n.
In you case, you'd write:
unsigned int value = 500000;
value = ((value%99)*(value%99))%99;

The numbers you are working with that are really big. It some other languages like python they have built in libraries that handle really big numbers in special ways because a single 32-bit integer cant hold that much. C++ does not do that, and as ForceBru put, the result is gigantic and is giving your overflow errors.
If you want to deal with long numbers you could manipulate polynomials in coefficient representation for big integers, or look for a scientific computing library online.

Related

How does a C/C++ compiler optimize division by non-powers-of-two? [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Assembler division x86 strange numbers [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Bit fiddling: Negate an int, only if it's negative

I'm working through a codebase that uses the following encoding to indicate sampling with replacement: We maintain an array of ints as indicators of position and presence in the sample, where positive ints indicate a position in another array and negative ints indicate that we should not use the data point in this iteration.
Example:
data_points: [...] // Objects vector of length 5 == position.size()
std::vector<int> position: [3, 4, -3, 1, -2]
would mean that the first element in data_points should go to bucket 3, the second to bucket 4, and the fourth to bucket 1.
The negative values indicate that for this iteration we won't be using those data points, namely the 3rd and 5th data points are marked as excluded because their values in position are negative and have been set with position[i] = ~position[i].
The trick is that we might perform this multiple times, but the positions of the data points in the index should not change. So in the next iteration, if we want to exclude data point 1, and include data point 5 we could do something like:
position[0] = ~position[0] // Bit-wise complement flips the sign on ints and subtracts 1
position[4] = ~position[4]
This will change the position vector into
std::vector<int> position: [-4, 4, -3, 1, 1]
Now on to the question: At the end of each round I want to reset all signs to positive, i.e. position should become [3, 4, 3, 1, 2].
Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
Edit: The above is wrong, the complement of a (int) will give -(a + 1) and depends on the representation of ints as noted in the answers. So the original question of simply taking the positive value of an existing one does not apply, we actually need to perform a bitwise complement to get the original value.
Example:
position[0] = 1
position[0] = ~position[0] // Now position[0] is -2
// So if we did
position[0] = std::abs(position[0]) // This is wrong, position[0] is now 2!
// If we want to reset it to 1, we need to do another complement
position[0] = ~position[0] // Now position[0] is again 1
I recommend not attempting bit fiddling. Partly because you're dealing with signed numbers, and you would throw away portability if you fiddle. Partly because bit fiddling is not as readable as reusable functions.
A simple solution:
std::for_each(position.begin(), position.end(), [](int v) {
return std::abs(v);
});
why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
It doesn't. Not in general anyway. It does so only on systems that use 1's complement representation for negative numbers, and the reason for that is simply because that's how the representation is specified. A negative number is represented by binary complement of the positive value.
Most commonly used representation these days is 2's complement which doesn't work that way.
Probably the first go-to source for bit twiddling hacks: The eponymous site
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I would question the assumption that position[i] = std::abs(position[i]) has worse performance. You should definitely have profiling results that demonstrate the bit hack being superior before you check in that kind of code.
Feel free to play around with a quick benchmark (with disassembly) of both - I don't think there is a difference in speed:
gcc 8.2
clang 6.0
Also take a look at the assembly that is actually generated:
https://godbolt.org/z/Ghcw_c
Evidently, clang sees your bithack and is not impressed - it generates a conditional move in all cases (which is branch-free). gcc does as you tell it, but has two more implementations of abs in store, some making use of register semantics of the target architecture.
And if you get into (auto-)vectorization, things get even more muddy. You'll have to profile regardless.
Conclusion: Just write std::abs - your compiler will do all the bit-twiddling for you.
To answer the extended question: Again, first write the obvious and intuitive code and then check that your compiler does the right thing: Look ma, no branches!
If you let it have its fun with auto-vectorization then you probably won't understand (or be good at judging) the assembly, so you have to profile anyway. Concrete example:
https://godbolt.org/z/oaaOwJ. clang likes to also unroll the auto-vectorized loop while gcc is more conservative. In any case, it's still branch-free.
Chances are that your compiler understands the nitty-gritty details of instruction scheduling on your target platform better than you. If you don't obscure your intent with bit-magic, it'll do a good job by itself. If it's still a hot spot in your code, you can then go and see if you can hand-craft a better version (but that will probably have to be in assembly).
Use functions to signify intent. Allow the compiler's optimiser to do a better job than you ever will.
#include <cmath>
void include_index(int& val)
{
val = std::abs(val);
}
void exclude_index(int& val)
{
val = -std::abs(val);
}
bool is_included(int const& val)
{
return val > 0;
}
Sample output from godbolt's gcc8 x86_64 compiler (note that it's all bit-twiddling and there are no conditional jumps - the bane of high performance computing):
include_index(int&):
mov eax, DWORD PTR [rdi]
sar eax, 31
xor DWORD PTR [rdi], eax
sub DWORD PTR [rdi], eax
ret
exclude_index(int&):
mov eax, DWORD PTR [rdi]
mov edx, DWORD PTR [rdi]
sar eax, 31
xor edx, eax
sub eax, edx
mov DWORD PTR [rdi], eax
ret
is_included(int const&):
mov eax, DWORD PTR [rdi]
test eax, eax
setg al
ret
https://godbolt.org/z/ni6DOk
Also, because I'm new to to bit fiddling like this, why/how does taking the bit complement of a signed, positive int give us its mathematical complement? (i.e. the same value with the sign flipped)
This question deserves an answer all by itself since everyone will tell you that this is how you do it, but nobody ever tells you why.
Notice that 1 - 0 = 1 and 1 - 1 = 0. What this means is that if we do 1 - b, where b is a single bit, the result is the opposite of b, or not b (~b). Also notice how this subtraction will never produce a borrow, this is very important, since b can only be at most 1.
Also notice that subtracting a number with n bits simply means performing n 1-bit subtractions, while taking care of borrows. But our special case will never procude a borrow.
Basically we have created a mathematical definiton for the bitwise not operation. To flip a bit b, do 1 - b. If we wanna flip an n bit number, do this for every bit. But doing n subtractions in sequence is the same as subtracting two n bit numbers. So if we wanna calculate the bitwise not of an 8-bit number, a, we simply do 11111111 - a, and the same for any n bit number. Once again this works because subtracting a bit from 1 will never produce a borrow.
But what is the sequence of n "1" bits? It's the value 2^n - 1. So taking the bitwise not of a number, a, is the same as calculating 2^n - 1 - a.
Now, numbers inside a computer are stored as numbers modulo 2^n. This is because we only have a limited number of bits available. You may know that if you work with 8 bits and you do 255 + 1 you'll get 0. This is because an 8-bit number is a number modulo 2^8 = 256, and 255 + 1 = 256. 256 is obviously equal to 0 modulo 256.
But why not do the same backwards? By this logic, 0 - 1 = 255, right? This is indeed correct. Mathematically, -1 and 255 are "congruent" modulo 256. Congruent essentially means equal to, but it's used to differentiate between regular equality and equality in modular arithmetic.
Notice in fact how 0 is also congruent to 256 modulo 256. So 0 - 1 = 256 - 1 = 255. 256 is our modulus, 2^8. But if bitwise not is defined as 2^n - 1 - a, then we have ~a = 2^8 - 1 - a. You'll notice how we have that - 1 in the middle. We can remove that by adding 1.
So we now have ~a + 1 = 2^n - 1 - a + 1 = 2^n - a. But 2^n - a is the negative a modulo n. And so here we have our negative number. This is called two's complement, and it's used in pretty much every modern processor, because it's the mathematical definition of a negative number in modular arithmetic modulo 2^n, and because numbers inside a processor work as if they were in modulo 2^n the math just works out by itself. You can add and subtract without doing any extra steps. Multiplication and division do require "sign extension", but that's just a quirk of how those operations are defined, the meaning of the number doesn't change when extending the sign.
Of course with this method you lose a bit, because you now have half of the numbers being positive and the other half negative, but you can't just magically add a bit to your processor, so the new range of value you can represent is from -2^(n-1) to 2^(n-1) - 1 inclusive.
Alternatively, you can keep the number as it is and not add 1 at the end. This is known as the one's complement. Of course, this is not quite the same as the mathematical negative, so adding, subtracting, multiplying and dividing don't just work out of the box, you need extra steps to adjust the result. This is why two's complement is the de facto standard for signed arithmetic. There's also the problem that, in one's complement, both 0 and 2^n - 1 represent the same quantity, zero, while in two's complement, negative 0 is still correctly 0 (because ~0 + 1 = 2^n - 1 + 1 = 2^n = 0). I think one's complement is used in the Internet Protocol as a checksum, but other than that it has very limited purpose.
But be aware, "de facto" standard means that this is just what everyone does, but there is no rule that says that it MUST be done this way, so always make sure to check the documentation of you target architecture to make sure you're doing the right thing. Even though let's be honest, chances of finding a useful one's complement processor out there nowadays are pretty much null unless you're working on some extremely specific architecture, but still, better be safe than sorry.
"Is there a bit-fiddling trick that will allow me to do this without having an if condition for the sign of the value?"
Do you need to keep the numeric value of the number being changed from a negative value?
If not, you can use std::max to set the negative values to zero
iValue = std::max(iValue, 0); // returns zero if iValue is less than zero
If you need to keep the numeric value, but just change from negative to positive, then
iValue = std::abs(iValue); // always returns a positive value of iValue

Saturate short (int16) in C++

I am optimizing bottleneck code:
int sum = ........
sum = (sum >> _bitShift);
if (sum > 32000)
sum = 32000; //if we get an overflow, saturate output
else if (sum < -32000)
sum = -32000; //if we get an underflow, saturate output
short result = static_cast<short>(sum);
I would like to write the saturation condition as one "if condition" or even better with no "if condition" to make this code faster. I don't need saturation exactly at value 32000, any similar value like 32768 is acceptable.
According this page, there is a saturation instruction in ARM. Is there anything similar in x86/x64?
I'm not at all convinced that attempting to eliminate the if statement(s) is likely to do any real good. A quick check indicates that given this code:
int clamp(int x) {
if (x < -32768)
x = -32768;
else if (x > 32767)
x = 32767;
return x;
}
...both gcc and Clang produce branch-free results like this:
clamp(int):
cmp edi, 32767
mov eax, 32767
cmovg edi, eax
mov eax, -32768
cmp edi, -32768
cmovge eax, edi
ret
You can do something like x = std::min(std::max(x, -32768), 32767);, but this produces the same sequence, and the source seems less readable, at least to me.
You can do considerably better than this if you use Intel's vector instructions, but probably only if you're willing to put a fair amount of work into it--in particular, you'll probably need to operate on an entire (small) vector of values simultaneously to accomplish much this way. If you do go that way, you usually want to take a somewhat different approach to the task than you seem to be taking right now. Right now, you're apparently depending on int being a 32-bit type, so you're doing the arithmetic on a 32-bit type, then afterwards truncating it back down to a (saturated) 16-bit value.
With something like AVX, you'd typically want to use an instruction like _mm256_adds_epi16 to take a vector of 16 values (of 16-bits apiece), and do a saturating addition on all of them at once (or, likewise, _mm256_subs_epi16 to do saturating subtraction).
Since you're writing C++, what I've given above are the names of the compiler intrinsics used in most current compilers (gcc, icc, clang, msvc) for x86 processors. If you're writing assembly language directly, the instructions would be vpaddsw and vpsubsw respectively.
If you can count on a really current processor (one that supports AVX 512 instructions) you can use them instead to operate on a vector of 32 16-bit values simultaneously.
Are you sure you can beat the compiler at this?
Here's x64 retail with max size optimizations enabled. Visual Studio v15.7.5.
ecx contains the intial value at the start of this block. eax is filled with the saturated value when it is done.
return (x > 32767) ? 32767 : ((x < -32768) ? -32768 : x);
mov edx,0FFFF8000h
movzx eax,cx
cmp ecx,edx
cmovl eax,edx
mov edx,7FFFh
cmp ecx,edx
movzx eax,ax
cmovg eax,edx

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.
What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.
Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient.
Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.
Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.
But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes
out += tab[out]
and with any luck at all, that will turn out to be one 16-bit add and a store operation.
If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?
This algorithm is used to convert binary numbers to decimal without using modulo or division.
Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.
0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10
Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.
0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
+ 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
= 28 % 10
This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).
But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)