Why is the runtime of a bitwise operation constant? - bit-manipulation

Let's say we are performing a XOR operation over an integer 1 and 4 to find the Hamming Distance. Why is the runtime of XOR and other bitwise operations below constant? Is it because the size of int is fixed in langauges like Python and etc., that's why the operation will take consant time regardless of integer inputs?
[Edit]
Let's say we are calculating the hamming distance of two integers using Brian Kernighan's Algorithm as below.
def hammingDistance(x: int, y: int):
xor = x ^ y
distance = 0
while xor:
distance += 1
# remove the rightmost bit of '1'
xor = xor & (xor - 1)
return distance

That function is not constant-time (it depends on what the highest set bit is), but each bitwise operator in it is constant-time on most (or all?) modern CPUs, which have instructions for those that operate on entire 16/32/64-bit operands in a fixed number of cycles.
Bitwise instructions are some of the simplest CPU instructions; I'd be curious to know if any processor ever had variable-time bit manipulation instructions. There are very few arithmetic CPU instructions that have runtimes that depend on their values. Some examples are division (in some/most? cases), multiplication when denormals are involved and pdep on current AMD CPUs.

Related

Replace right shift by multiplication

I know that it is possible to use the left shift to implement multiplication by the power of two (x << 4 = x * 16).
Also, it is trivial to replace the right shift by division by a power of two (x >> 5 = x / 32).
I am wondering is it possible to replace the right shift with multiplication?
It seems to be not possible in the general case, but my question is limited to modulo 2^32 and 2^64 arithmetic (unsigned 32-bit and 64-bit values). Also, maybe it can be done if we can add other cheap instructions like + and - in addition to * to emulate the right bit shift?
I assume exotic architecture where the right shift is more expensive than other arithmetic (similar to division).
uint64_t foo(uint64_t x) {
return x >> 3; // how to avoid using right shift here?
}
There is a similar question How to perform right shifting binary multiplication? that asks how to replace multiplication of two unsigned numbers by right shift. Basically, it uses a loop internally. However, maybe if the second number is a constant, this loop can be avoided (or at least unrolled to a shorter fragment)?
"Multiply-high" aka high-mul, hmul, mulh, etc, can be used to emulate a shift-right with a constant count. Usually that's not a good trade. It's also hardly related to C++.
Normal multiplication (putting floating point stuff aside) cannot be used to implement a shift-right.
my question is limited to modulo 2^32 and 2^64 arithmetic
It doesn't help. You can use that property to "unmultiply" (sort of like divide, except not really) by odd numbers, for example if b = 5 * a then a = b * 0xCCCCCCCD, using the modular multiplicative inverse. The number being inverted must be relatively-prime relative to the modulus. Since the modulus is a power of two, the "divisor" here cannot be a power of two (except 1, but that does nothing), so a shift-right cannot be done this way.
Another way to look at it (probably simpler), is that what a multiplication does is conditionally add together a bunch of left-shifted versions of the multiplicand. Only left-shift versions, not right-shifted versions. Which of those shifted versions are selected by the multiplier doesn't matter, there are no right-shifted versions to select.

Is it faster to multiply low numbers in C/C++ (as opposed to high numbers)?

Example of question:
Is calculating 123 * 456 faster than calculating 123456 * 7890? Or is it the same speed?
I'm wondering about 32 bit unsigned integers, but I won't ignore answers about other types (64 bit, signed, float, etc.). If it is different, what is the difference due to? Whether or not the bits are 0/1?
Edit: If it makes a difference, I should clarify that I'm referring to any number (two random numbers lower than 100 vs two random numbers higher than 1000)
For builtin types up to at least the architecture's word size (e.g. 64 bit on a modern PC, 32 or 16 bit on most low-cost general purpose CPUs from the last couple decades), for every compiler/implementation/version and CPU I've ever heard of, the CPU opcode for multiplication of a particular integral size takes a certain number of clock cycles irrespective of the quantities involved. Multiplications of data with different sizes, performs differently on some CPUs (e.g. AMD K7 has 3 cycles latency for 16 bit IMUL, vs 4 for 32 bit).
It is possible that on some architecture and compiler/flags combination, a type like long long int has more bits than the CPU opcodes can operate on in one instruction, so the compiler may emit code to do the multiplication in stages and that will be slower than multiplication of CPU-supported types. But again, a small value stored at run-time in a wider type is unlikely to be treated - or perform - any differently than a larger value.
All that said, if one or both values are compile-time constants, the compiler is able to avoid the CPU multiplication operator and optimise to addition or bit shifting operators for certain values (e.g. 1 is obviously a no-op, either side 0 ==> 0 result, * 4 can sometimes be implemented as << 2). There's nothing in particular stopping techniques like bit shifting being used for larger numbers, but a smaller percentage of such numbers can be optimised to the same degree (e.g. there're more powers of two - for which multiplication can be performed using bit shifting left - between 0 and 1000 than between 1000 and 2000).
This is highly dependendent on the processor architecture and model.
In the old days (ca 1980-1990), the number of ones in the two numbers would be a factor - the more ones, the longer it took to multiply [after sign adjustment, so multiplying by -1 wasn't slower than multiplying by 1, but multiplying by 32767 (15 ones) was notably slower than multiplying by 17 (2 ones)]. That's because a multiply is essentially:
unsigned int multiply(unsigned int a, unsigned int b)
{
res = 0;
for(number of bits)
{
if (b & 1)
{
res += a;
}
a <<= 1;
b >>= 1;
}
}
In modern processors, multiply is quite fast either way, but 64-bit multiply can be a clock cycle or two slower than a 32-bit value. Simply because modern processors can "afford" to put down the whole logic for doing this in a single cycle - both when it comes to speed of transistors themselves, and the area that those transistors take up.
Further, in the old days, there was often instructions to do 16 x 16 -> 32 bit results, but if you wanted 32 x 32 -> 32 (or 64), the compiler would have to call a library function [or inline such a function]. Today, I'm not aware of any modern high end processor [x86, ARM, PowerPC] that can't do at least 64 x 64 -> 64, some do 64 x 64 -> 128, all in a single instruction (not always a single cycle tho').
Note that I'm completely ignoring the fact that "if the data is in cache is an important factor". Yes, that is a factor - and it's a bit like ignoring wind resistance when traveling at 200 km/h - it's not at all something you ignore in the real world. However, it is quite unimportant for THIS discussion. Just like people making sports cars care about aerodynamics, to get complex [or simple] software to run fast involves a certain amount of caring about the cache-content.
For all intents and purposes, the same speed (even if there were differences in computation speed, they would be immeasurable). Here is a reference benchmarking different CPU operations if you're curious: http://www.agner.org/optimize/instruction_tables.pdf.

Who defines the sign inversion pattern for integers?

Twos complement means that simply inverting all bits of a number i gets me -i-1:
~0 is -1
~01000001 is 10111110
~65 is -66
etc. To switch the sign of an integer, I have to use the actual minus sign.
int i = 65; int j = -i;
cout << j; // -65
Where is that actual behavior defined, and whose responsibility is it to ensure the twos complement pattern (to make a number negative invert all bits and add 1) is followed? I don't even know if this is a hardware or compiler operation.
It's usually done by the CPU hardware.
Some CPUs have an instruction for calculating the negative of a number. In the x86 architecture, it's the NEG instruction.
If not, it can be done using the multiplication operator, multiplying the number by -1. But many programmers take advantage of the identity that you discovered, and complement the number and then add 1. See
How to convert a positive number to negative in assembly
the reasons for this are simple: consistency with 0 and addition.
You want addition to work the same for positive and negative numbers without special cases... in particulary, incrementing -1 by 1 must yield 0.
The only bit sequence, where the classic overflowing increment produces the value of 0 is the all-1 bit sequence. If you increment by 1, you get all zeros. So that is your -1: all 1s, i.e. bitwise negation of 0.
Now we have (assuming 8 bit integers, incrementing by 1 each line)
-2: 11111110 = ~1
-1: 11111111 = ~0
0: 00000000 = ~-1
+1: 00000001 = ~-2
If you don't like this behavior, you need to handle special cases in addition, and you'll have +0 and -0. Most likely, such a CPU would be a lot slower.
If your question is how
int i = -j;
is implemented, that depends on your compiler and CPU and optimization. Usually it will be optimized together with other operations you specify. But don't be surprised if this ends up being performed as
int i = 0 - j;
Since this probably takes 1-2 cpu ticks to compute (e.g. as one XOR or a register onto itself to get 0, then a SUB operation to compute 0-j), it will barely ever be a bottleneck. Loading j and storing the result i somewhere in memory will be a lot lot lot more expensive. In fact, some CPUs (MIPS?) even have a built-in register that is always zero. Then you don't need a special instruction for negation, you simply subtract j from $zero in usually 1 tick.
Current Intel CPUs are said to recognize such xor operarions and do them in 0 ticks, with register renaming optimizations (i.e. they let the next instruction use a new register that is zero). You have neg on amd64, but a fast xor rax,rax is useful in other situations, too.
C arithmetic is defined in terms of values. When the code is:
int i = 65;
int j = -i;
the compiler will emit whatever CPU instructions are required to give j the value of -65, regardless of the bit representation.
Historically, not all systems used 2's complement. The C compiler would choose a system of negative numbers that leads to the most efficient output for the target CPU, based on the CPU's capabilities.
However 2's complement is a very common choice because it leads to the simplest algorithms for doing arithmetic. For example the same instruction can be used for + - * with signed and unsigned integers.

Is casting a signed integer to a binary floating point number cheaper than the inverse operation?

I know from articles like "Why you should never cast floats to ints" and many others like it that casting a float to a signed int is expensive. I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer.
Before anyone says "Why don't you just test it?" I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. Is there something inherent to the algorithm for conversion that affects performance in general?
Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons:
Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, e.g. 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision.
There is no need to check for NAN or +/- INF or +/- 0.
There is no danger of overflow or underflow.
What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? Is conversion from int to float generally cheaper than float to int?
Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. This can be accepted as a good example.
From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. A naive but rather good explanation follows. In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary.
The first clock cycle with appropriate hardware assistance (priority encoder) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). Otherwise, if the input integer was negative, it shall be negated; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). So, 1-2 cycles are spent.
At the same time, it can calculate the biased exponent prediction, based on CLZ result. Notice we needn't take care of denormalized values or infinity. (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.)
Then, shift is applied (1 cycle again, with barrel shifter) to place the integer value so its highest 1 is at a fixed position (e.g. with standard 3 extension bits and 24-bit mantissa, this is bit number 26). This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher). Now, up to 3 cycles.
Then, rounding is applied. Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). Rounding to zero (RZ) results in ignoring guard/round/sticky bits. Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; x101...x111 -> add 1; 0100 -> discard; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now.
This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). So, 5 cycles now.
Is conversion from int to float generally cheaper than float to int?
We can estimate the same conversion e.g. from binary32 to int32 (signed). Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). In that case:
Split and analyze the input value: S - sign; BE - biased exponent; M - mantissa (significand); also apply rounding mode. A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; or, if BE < 126 (abs(x) < 0.5) with RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; x > 0 and RPI; BE = 126 and RNE. All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2.
If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). Get 157-BE from cycle 1, this is needed shift amount. Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. Here, 2 cycles are spent.
Apply rounding (see above). 3 cycles spent, and carry can be produced. Here, we can again detect integer overflow and produce the respective result value. Forget additional bits, only 31 bits are valued now.
Finally, negate the resulting value, if x was negative (sign=1). Up to 4 cycles spent.
I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production).
You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. Start with ui32_to_f32.c source file. They use more additional bits for intermediate values, but this isn't principal.
See #Netch's excellent answer re the algorithm, but it's not just the algorithm. The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. But when storing FP to integer, there has to be an FWAIT (Intel).

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.