fastest way to convert two-bit number to low-memory representation - c++

I have a 56-bit number with potentially two set bits, e.g., 00000000 00000000 00000000 00000000 00000000 00000000 00000011. In other words, two bits are distributed among 56 bits, so that we have bin(56,2)=1540 possible permutations.
I now look for a loss-free mapping of such an 56 bit number to an 11-bit number that can carry 2048 and therefore also 1540. Knowing the structure, this 11-bit number is enough to store the value of my low-density (of ones) 56 bit number.
I want to maximize performance (this function should run millions or even billions of times per second if possible). So far, I only came up with some loop:
int inputNumber = 24; // 11000
int bitMask = 1;
int bit1 = 0, bit2 = 0;
for(int n = 0; n < 54; ++n, bitMask *= 2)
{
if((inputNumber & bitMask) != 0)
{
if(bit1 != 0)
bit1 = n;
else
{
bit2 = n;
break;
}
}
}
and using these two bits, I can easily generate some 1540 max number.
But is there no faster version than using such a loop?

Most ISAs have hardware support for a bit-scan instruction that finds the position of a set bit. Use that instead of a naive loop or bithack for any architecture where you care about this running fast. https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious has some tricks that are better than nothing, but those are all still much worse than a single efficient asm instruction.
But ISO C++ doesn't portably expose clz/ctz operations; it's only available via intrinsics / builtins for various implementations. (And the x86 intrinsincs have quirks for all-zero input, corresponding to the asm instruction behaviour).
For some ISAs, it's a count-leading-zeros giving you 31 - highbit_index. For others, it's a CTZ count trailing zeros operation, giving you the index of the low bit. x86 has both. (And its high-bit finder actually directly finds the high-bit index, not a leading-zero count, unless you use BMI1 lzcnt instead of traditional bsr) https://en.wikipedia.org/wiki/Find_first_set has a table of what different ISAs have.
GCC portably provides __builtin_clz and __builtin_ctz; on ISAs without hardware support, they compile to a call to a helper functions. See What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? and Implementation of __builtin_clz
(For 64-bit integers, you want the long long versions: like __builtin_ctzll GCC manual.)
If we only have a CLZ, use high=63-CLZ(n) and low= 63-CLZ((-n) & n) to isolate the low bit. Note that x86's bsr instruction actually produces 63-CLZ(), i.e. the bit-index instead of the leading-zero count. So 63-__builtin_clzll(n) can compile to a single instruction on x86; IIRC gcc does notice this. Or 2 instructions if GCC uses an extra xor-zeroing to avoid the inconvenient false dependency.
If we only have CTZ, do low = CTZ(n) and high = CTZ(n & (n - 1)) to clear the lowest set bit. (Leaving the high bit, assuming the number has exactly 2 set bits).
If we have both, low = CTZ(n) and high = 63-CLZ(n). I'm not sure what GCC does on non-x86 ISAs where they aren't both available natively. The GCC builtins are always available even when targeting HW that doesn't have it. But the internal implementation can't use the above tricks because it doesn't know there are always exactly 2 bits set.
(I wrote out the full formulas; an earlier version of this answer had CLZ and CTZ reversed in this part. I find that happens to me easily, especially when I also have to keep track of x86's bsr and bsr (bitscan reverse and forward) and remember that those are leading and trailing, respectively.)
So if you just use both CTZ and CLZ, you might end up with slow emulation for one of them. Or fast emulation on ARM with rbit to bit-reverse for clz, which is 100% fine.
AVX512CD has SIMD VPLZCNTQ for 64-bit integers, so you could encode 2, 4, or 8x 64-bit integers in parallel with that on recent Intel CPUs. For SSSE3 or AVX2, you can build a SIMD lzcnt by using pshufb _mm_shuffle_epi8 byte-shuffle as a 4-bit LUT and combining with _mm_max_epu8. There was a recent Q&A about this but I can't find it. (It might have been for 16-bit integers only; wider requires more work.)
With this, a Skylake-X or Cascade Lake CPU could maybe compress 8x 64-bit integers per 2 or 3 clock cycles once you factor in the throughput cost of packing the results. SIMD is certainly useful for packing 12-bit or 11-bit results into a contiguous bitstream, e.g. with variable-shift instructions, if that's what you want to do with the results. At ~3 or 4GHz clock speed, that could maybe get you over 10 billion per clock with a single thread. But only if the inputs come from contiguous memory. Depending what you want to do with the results, it might cost a few more cycles to do more than just pack them down to 16-bit integers. e.g. to pack into a bitstream. But SIMD should be good for that with variable-shift instructions that can line up the 11 or 12 bits from each register into the right position to OR together after shuffling.
There's a tradeoff between coding efficiency and encode performance. Using 12 bits for two 6-bit indices (of bit positions) is very simple both to compress and decompress, at least on hardware that has bit-scan instructions.
Or instead of bit-indices, one or both could be leading zero counts, so decoding would be (1ULL << 63) >> a. 1ULL>>63 is a fixed constant that you can actually right-shift, or the compiler could turn it into a left-shift of 1ULL << (63-a) which IIRC optimizes to 1 << (-a) in assembly for ISAs like x86 where shift instructions mask the shift count (look only at the low 6 bits).
Also, 2x 12 bits is a whole number of bytes, but 11 bits only gives you a whole number of bytes every 8 outputs, if you're packing them. So indexing a bit-packed array is simpler.
0 is still a special case: maybe handle that by using all-ones bit-indices (i.e. index = bit 63, which is outside the low 56 bits). On decode/decompress, you set the 2 bit positions (1ULL<<a) | (1ULL<<b) and then & mask to clear high bits. Or bias your bit indices and have decode right shift by 1.
If we didn't have to handle zero then a modern x86 CPU could do 1 or 2 billion encodes per second if it didn't have to do anything else. e.g. Skylake has 1 per clock throughput for bit-scan instructions and should be able to encode at 1 number per 2 clocks just bottlenecked on that. (Or maybe better with SIMD). With just 4 scalar instructions, we can get the low and high indices (64-bit tzcnt + bsr), shift by 6 bits, and OR together.1 Or on AMD, avoid bsr / bsf and manually do 63-lzcnt.
A branchy or branchless check for input == 0 to to set the final result to whatever hard-coded constant (like 63 , 63) should be cheap, though.
Compression on other ISAs like AArch64 is also cheap. It has clz but not ctz. Probably your best bet there is use an intrinsic for rbit to bit-reverse a number (so clz on the bit-reversed number directly gives you the bit-index of the low bit. Which is now the high bit of the reversed version.) Assuming rbit is as fast as add / sub, this is cheaper than using multiple instructions to clear the low bit.
If you really want 11 bits then you need to avoid the redundancy of 2x 6-bit being able to have either index larger than the other. Like maybe have 6-bit a and 5-bit b, and have a<=b mean something special like b+=32. I haven't thought this through fully. You need to be able to encode 2 adjacent bits either near the top or bottom of the registers, or the 2 set bits could be as far apart as 28 bits, if we consider wrapping at the boundaries like a 56-bit rotate.
Melpomene's suggestion to isolate the low and high set bits might be useful as part of something else, but is only useful for encoding on targets where you only have one direction of bit-scan available, not both. Even so, you wouldn't actually use both expressions. Leading-zero count doesn't require you to isolate the low bit, you just need to clear it to get at the high bit.
Footnote 1: decoding on x86 is also cheap: x |= (1<<a) is 1 instruction: bts. But many compilers have missed optimizations and don't notice this, instead actually shifting a 1. bts reg, reg is 1 uop / 1 cycle latency on Intel since PPro, or sometimes 2 uops on AMD. (Only the memory destination version is slow.) https://agner.org/optimize/
Best encoding performance on AMD CPUs requires BMI1 tzcnt / lzcnt because bsr and bsf are slower (6 uops instead of 1 https://agner.org/optimize/). On Ryzen, lzcnt is 1 uop, 1c latency, 4 per clock throughput. But tzcnt is 2 uops.
With BMI1, the compiler could use blsr to clear the lowest set bit of a register (and copy it). i.e. modern x86 has an instruction for dst = (SRC-1) bitwiseAND ( SRC ); that are single-uop on Intel but 2 uops on AMD.
But with lzcnt being more efficient than tzcnt on AMD Ryzen, probably the best asm for AMD doesn't use it.
Or maybe something like this (assuming exactly 2 bits, which apparently we can do).
(This asm is what you'd like to get your compiler to emit. Don't actually use inline asm!)
Ryzen_encode_scalar: ; input in RDI, output in EAX
lzcnt rcx, rdi ; 63-high bit index
tzcnt rdx, rdi ; low bit
mov eax, 63
sub eax, ecx
shl edx, 6
or eax, edx ; (low_bit << 6) | high_bit
ret ; goes away with inlining.
Shifting the low bit-index balances the lengths of the critical path, allowing better instruction-level parallelism, if we need 63-CLZ for the high bit.
Throughput: 7 uops total, and no execution-unit bottlenecks. So at 5 uops per clock pipeline width, that's better than 1 per 2 clocks.
Skylake_encode_scalar: ; input in RDI, output in EAX
tzcnt rax, rdi ; low bit. No false dependency on Skylake. GCC will probably xor-zero RAX because there is on Broadwell and earlier.
bsr rdi, rdi ; high bit index. same,same reg avoids false dep
shl eax, 6
or eax, edx
ret ; goes away with inlining.
This has 5 cycle latency from input to output: bitscan instructions are 3 cycles on Intel vs. 1 on AMD. SHL + OR each add 1 cycle.
For throughput, we only bottleneck on one bit-scan per cycle (execution port 1), so we can do one encode per 2 cycles with 4 uops of front-end bandwidth left over for load, store, and loop overhead (or something else), assuming we have multiple independent encodes to do.
(But for the multiple independent encode case, SIMD may still be better for both AMD and Intel, if a cheap emulation of vplzcntq exists and the data is coming from memory.)
Scalar decode can be something like this:
decode: ;; input in EDI, output in RAX
xor eax, eax ; RAX=0
bts rax, rdi ; RAX |= 1ULL << (high_bit_idx & 63)
shr edi, 6 ; extract low_bit_idx
bts rax, rdi ; RAX |= 1ULL << low_bit_idx
ret
This has 3 shifts (including the bts) which on Skylake can only run on port0 or port6. So on Intel it only costs 4 uops for the front-end (so 1 per clock as part of doing something else). But if doing only this, it bottlenecks on shift throughput at 1 decode per 1.5 clock cycles.
On a 4GHz CPU, that's 2.666 billion decodes per second, so yeah we're doing pretty well hitting your targets :)
Or Ryzen, bts reg,reg is 2 uops , with 0.5c throughput, but shr can run on any port. So it doesn't steal throughput from bts, and the whole thing is 6 uops (vs. Ryzen's pipeline being 5-wide at the narrowest point). So 1 encode per 1.2 clock cycles, just bottlenecked on front-end cost.
With BMI2 available, starting with a 1 in a register and using shlx rax, rbx, rdi can replace the xor-zeroing + first BTS with a single uop, assuming the 1 in a register can be reused in a loop.
(This optimization is totally dependent on your compiler to find; flag-less shifts are just more efficient ways to copy-and-shift that become available with -march=haswell or -march=znver1, or other targets that have BMI2.)
Either way you're just going to write retval = 1ULL << (packed & 63) for decoding the first bit. But if you're wondering which compilers make nice code here, this is what you're looking for.

Related

How to unset N right-most set bits

There is a relatively well-known trick for unsetting a single right-most bit:
y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)
I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?
Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).
// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000
I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.
Maybe there is some hardware support for it?
For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.
I'm not aware of other ISAs having a similar bit-packing hardware operation.
pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.
But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.
uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
return _pdep_u64(-1ULL << n, a);
}
-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.
On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.
GCC and clang both compile it the obvious way, as written:
# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
mov rax, -1
shlx rax, rax, rsi
pdep rax, rax, rdi
ret
(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)
So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).
And is only 3 uops for the front-end.
A semi-related BMI2 trick:
pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.
Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to #harold for spotting this in the first version of this answer.
Without fast PDEP: perhaps binary search for the right popcount
#Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.
Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).
Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.
High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).
Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

Split a number into several numbers, each with only one significant bit

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.
I want to isolate each set bit in a number. For example,
input:
01100100
output:
01000000
00100000
00000100
Only comes to mind number & mask.
Assembly or С++.
Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:
while (number) {
// extract lowest set bit in number
uint64_t m = number & -number;
/// use m
...
// remove lowest set bit from number
number &= number - 1;
}
In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.
Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.
The standard way is
while (num) {
unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
...
num ^= mask;
}
for example starting with num = 2019 you will get in order
1
2
32
64
128
256
512
1024
If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see #harold's answer.
But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).
The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.
An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)
See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?
#include <stdint.h>
#include <immintrin.h>
// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
const __m512i bitmasks_lo = _mm512_set_epi32(
1UL << 15, 1UL << 14, 1UL << 13, 1UL << 12,
1UL << 11, 1UL << 10, 1UL << 9, 1UL << 8,
1UL << 7, 1UL << 6, 1UL << 5, 1UL << 4,
1UL << 3, 1UL << 2, 1UL << 1, 1UL << 0
);
const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16); // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.
__mmask16 set_lo = x;
__mmask16 set_hi = x>>16;
int count_lo = _mm_popcnt_u32(set_lo); // doesn't actually cost a kmov, __mask16 is really just uint16_t
_mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
_mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);
return _mm_popcnt_u32(x);
}
Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)
# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
movzx ecx, si
popcnt eax, esi
shr esi, 16
popcnt edx, ecx
kmovd k1, ecx
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
vpcompressd zmmword ptr [rdi] {k1}, zmm0
kmovd k1, esi
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
vpcompressd zmmword ptr [rdi + 4*rdx] {k1}, zmm0
vzeroupper
ret
On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.
It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.
On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).
This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.
Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)
(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)
Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.
Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).
I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.
You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.
If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.
Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.
You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).
The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.
vector: vpslld immediate, starting from the vector constant
mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.
The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

How to create a 8 bit mask from lsb of __m64 value?

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?
There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:
_mm_movemask_pi8(_mm_slli_si64(x, 7))
Using MMX these days is strange and should probably be avoided.
Here is an SSE2 version, still reading only 8 bytes:
int lsb_mask8(uint8_t* bits) {
__m128i x = _mm_loadl_epi64((__m128i*)bits);
return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}
Using SSE2 instead of MMX avoids the needs for EMMS
If you have efficient BMI2 pext (e.g. Haswell and newer, same as AVX2), then use the inverse of #wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).
unsigned extract8LSB(uint8_t *arr) {
uint64_t bytes;
memcpy(&bytes, arr, 8);
unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
return LSBs;
}
This compiles like you'd expect to a qword load + a pext instruction. Compilers will hoist the 0x01... constant setup out of a loop after inlining.
pext / pdep are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use #harold's pmovmskb answer.
Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.
If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8 is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.

What's the meaning of "each CPU instruction can manipulate 32 bits of data"?

From: https://www.webopedia.com/TERM/R/register.html
The number of registers that a CPU has and the size of each (number of bits) help determine the power and speed of a CPU. For example a 32-bit CPU is one in which each register is 32 bits wide. Therefore, each CPU instruction can manipulate 32 bits of data.
What's the meaning of "each CPU instruction can manipulate 32 bits of data" w.r.t the C/C++ programs which we write, the text which we write in notepads?
First; "each CPU instruction can manipulate 32 bits of data" is a (technically incorrect) generalisation. For example (32-bit 80x86) there are instructions (e.g. cmpxchg8b, pushad, shrd) and entire extensions (MMX, SSE, AVX) where an instruction can manipulate more than 32 bits of data.
For performance; it's best to think of it as either "amount of work that can be done in a fixed amount of time" or "amount of time to do a fixed amount of work". This can be broken into 2 values - how many instructions you need to do an amount of work and how many instructions can be executed in a fixed amount of time (instructions per second).
Now consider something like adding a pair of 128-bit integers. For a 32-bit CPU this has to be broken down into four 32-bit additions, and might look something like this:
;Do a = a + b
mov eax,[b]
mov ebx,[b+4]
mov ecx,[b+8]
mov edx,[b+12]
add [a],eax
adc [a+4],ebx
adc [a+8],ecx
adc [a+12],edx
In this case "how many instructions you need to do an amount of work" is 8 instructions.
With a 16-bit CPU you need more instructions. For example, it might be more like this:
mov ax,[b]
mov bx,[b+2]
mov cx,[b+4]
mov dx,[b+6]
add [a],ax
mov ax,[b+8]
adc [a+2],bx
mov bx,[b+10]
adc [a+4],cx
mov cx,[b+12]
adc [a+6],dx
mov dx,[b+14]
add [a+8],ax
adc [a+10],bx
adc [a+12],cx
adc [a+14],dx
In this case "how many instructions you need to do an amount of work" is 16 instructions. With the same "instructions per second" a 16-bit CPU would be half as fast as a 32-bit CPU for this work.
With a 64-bit CPU this work would only need 4 instruction, maybe like this:
mov eax,[b]
mov ebx,[b+8]
add [a],eax
adc [a+8],ebx
In this case, with the same "instructions per second", a 64-bit CPU would be twice as fast as a 32-bit CPU (and 4 times as fast as a 16-bit CPU).
Of course the high level source code would be the same in all cases - the difference is what the compiler generates.
Note that what I've shown here (128-bit integer addition) is a "happy case" - I chose this specifically because it's easy to show how larger registers can reduce/improve "how many instructions you need to do an amount of work" and therefore improve performance (at the same "instructions per second"). For different work you might not get the same improvement. For example, for a function that works with 8-bit integers (e.g. char) "larger than 8-bit registers" might not help at all (and in some cases might make things worse).
Computers, operating systems, or software programs capable of transferring data 32-bits at a time. With computer processors, (e.g. 80386, 80486, and Pentium) they were 32-bit processors, which means the processor were capable of working with 32 bit binary numbers (decimal number up to 4,294,967,295). Anything larger and the computer would need to break up the number into smaller pieces
A "word" is the size of the basic unit of change.
This CPU, in addition to being a 32 bit CPU, has a 32 bit sized word. If it was changing an item, unless extra CPU cycles are used, the largest "single" item it can change is one 32 bit value.
This doesn't mean that any 32 bit can be changed with one instruction. But if the 32 bits are all part of the same word, they may be able to be changed in one instruction.

Mysteries of C++ optimization

Take the two following snippets:
int main()
{
unsigned long int start = utime();
__int128_t n = 128;
for(__int128_t i=1; i<1000000000; i++)
n = (n * i);
unsigned long int end = utime();
cout<<(unsigned long int) n<<endl;
cout<<end - start<<endl;
}
and
int main()
{
unsigned long int start = utime();
__int128_t n = 128;
for(__int128_t i=1; i<1000000000; i++)
n = (n * i) >> 2;
unsigned long int end = utime();
cout<<(unsigned long int) n<<endl;
cout<<end - start<<endl;
}
I am benchmarking 128 bit integers in C++. When executing the first one (just multiplication) everything runs in approx. 0.95 seconds. When I also add the bit shift operation (second snippet) the execution time raises to an astounding 2.49 seconds.
How is this possible? I thought that bit shifting was one of the lightest operations for a processor. How comes that there is so much overhead due to such a simple operation? I am compiling with O3 flag activated.
Any idea?
This question has been bugging me for the past few days, so I decided to do some more investigation. My initial answer focused on the difference in data values between the two tests. My assertion was that the integer multiplication unit in the processor finishes an operation in fewer clock cycles if one of the operands is zero.
While there are instructions that are clearly documented to work that way (integer division, for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. As far as I can tell, harold's comment on this subject is correct. My mistake; sorry.
While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. It's up to you to decide whether it was worth it.
The information is structured into the following sections:
What the code does
What the compiler does
What the processor does
What you can do about it
Unanswered questions
What the code does
It overflows, mostly.
In the first version, n starts overflowing on the 33rd iteration. In the second version, with the shift, n starts overflowing on the 52nd iteration.
In the version without the shift, starting with the 128th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result).
In the second version, the right shift (dividing by 4) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to
128 / 2 + 128 / 4 + ... + 2 + 1 = 26 + 25 + ... + 2 + 1 = 27 - 1
while the number of factors of two taken out by the right shift (if it had enough to take from) is 128 * 2, more than double.
Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. Problem solved; stop reading now.
What the compiler does
If you're still reading, from this point forward we'll ignore the overflows and look at some implementation details. "The compiler", in this case, means GCC 4.9.2 or Clang 3.5.1. I've only done performance measurements on code generated by GCC. For Clang, I've looked at the generated code for a few test cases and noted some differences that I'll mention below, but I haven't actually run the code; I might have missed some things.
Both multiplication and shift operations are available for 64-bit operands, so 128-bit operations need to be implemented in terms of those. First, multiplication: n can be written as 264 nh + nl, where nh and nl are the high and low 64-bit halves, respectively. The same goes for i. So, the multiplication can be written:
(264 nh + nl)(264 ih + il) = 2128 nh ih + 264 (nh il + nl ih) + nl il
The first term doesn't have any non-zero bits in the lower 128-bit part; it's either all overflow or all zero. Since ignoring integer overflows is valid and common for C++ implementations, that's what the compiler does: the first term is ignored completely.
The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits).
The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis.
That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions.
Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl, after the contents of the latter is shifted right by two bits. Using C++ syntax, it would look like this:
nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.
Besides that, nh also needs to be shifted, using something like
nh >>= 2;
That is how the compiler implements a 128-bit shift. For the first part, there's an x86-64 instruction that has the exact semantics of that expression; it's called SHRD. Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect.
What the processor does
... is highly processor-dependent. (No... really?!)
Detailed information about what happens for Haswell processors can be found in harold's excellent answer. Here, I'll try to cover more ground at a higher level. For more detailed data, here are some sources:
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Software Optimization Guide for AMD Family 15h Processors
Agner Fog's microarchitecture manual and instruction tables.
I'll refer to the following architectures:
Intel Sandy Bridge / Ivy Bridge - abbreviated "IntelSB" going forward;
Intel Haswell / Broadwell - "IntelH" going forward;
I'll just use "Intel" for things that are the same between SB and H.
AMD Bulldozer / Piledriver / Steamroller - "AMD" going forward.
I have measurement data taken on an IntelSB system; I think it's precise enough, as long as the compiler doesn't act up. Unfortunately, when working with such tight loops, this can happen very easily. At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case.
First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram:
nh * il
\
+ -> tmp
/ \
nl * ih + -> next nh
/
high 64 bits
/
nl * il --------
\
low 64 bits
\
-> next nl
(sorry, I hope it gets the point across)
Some important points:
The two additions can't execute until their respective inputs are ready; the final addition can't execute until everything else is ready.
The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output).
In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions.
The next nl can be ready early. This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed.
Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet.
All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. This applies to the first code snippet. On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually).
The problem is that in the second code snippet something does get in the way:
nh * il
\ normal shift -> next nh
+ -> tmp /
/ \ /
nl * ih + ----> temp nh
/ \
high 64 bits \
/ "composite" shift -> next nl
nl * il -------- /
\ /
low 64 bits /
\ /
-> temp nl ---------
The next nl is no longer ready early. temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift, and only then will we have the next nl. Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier.
If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; they'll have to wait for one another, wasting cycles. This is the case on IntelSB, where the two shifts are equally fast (see below); I measured 8 cycles per iteration for this case.
If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. This is the case on IntelH, measured by harold at 9 cycles per iteration.
I expect AMD to fall under this last category as well. While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though.
Some more data on the troublesome composite shift, SHRD:
On IntelSB, it's exactly as cheap as a simple shift (great!); simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle.
On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle;
On AMD, it's even worse. If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; it takes 3 cycles, during which no other shift can start executing.
What you can do about it
I can think of three possible improvements:
replace SHRD with something faster on platforms where it makes sense;
optimize the multiplication to take advantage of the data types involved here;
restructure the loop.
1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. A C++ implementation of a 128-bit shift right by two bits could look like this:
__int128_t shr2(__int128_t n)
{
using std::int64_t;
using std::uint64_t;
//Unpack the two halves.
int64_t nh = n >> 64;
uint64_t nl = static_cast<uint64_t>(n);
//Do the actual work.
uint64_t rl = nl >> 2 | nh << 62;
int64_t rh = nh >> 2;
//Pack the result.
return static_cast<__int128_t>(rh) << 64 | rl;
}
Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. The other parts merely indicate to the compiler which 64-bit parts we want to work with; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code.
However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below).
Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. You could use a non-type template parameter to generate different functions at compile time - this is C++, after all...
I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. On AMD, it will definitely be an improvement. Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). I don't think it will make a difference for our loop because it will change things from [nh being ready after once cycle and nl after three] to [both being ready after two]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift.
GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code.
2. Optimizing the multiplication: The test code uses a 128-bit integer for i, but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). What this means is that ih will always be zero; this reduces the diagram for 128-bit multiplication to the following:
nh * il
\
\
\
+ -> next nh
/
high 64 bits
/
nl * il
\
low 64 bits
\
-> next nl
Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way:
__int128_t mul(__int128_t n, long long i)
{
using std::int64_t;
using std::uint64_t;
//Unpack the two halves.
int64_t nh = n >> 64;
uint64_t nl = static_cast<uint64_t>(n);
//Do the actual work.
__asm__(R"(
movq %0, %%r10
imulq %2, %%r10
mulq %2
addq %%r10, %0
)" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");
//Pack the result.
return static_cast<__int128_t>(nh) << 64 | nl;
}
I said I tried to avoid assembly code in this answer, but it's not always possible. I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there.
Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. I expect it to shave one cycle off the iterations for both examples on IntelH as well.
3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. For example, we could replace the for loop for the second sample with something like this:
__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
n *= i;
n2 *= j;
n >>= 2;
n2 >>= 2;
}
n *= (n2 * j) >> 2;
This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code).
The code above is nothing more than a naive proof of concept. Real code would need to handle the total number of iterations in a more reliable way. The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. In real code, this would most likely be a problem; but then again, you wouldn't have real code like this, would you?
Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total number of cycles for executing the whole loop by the total number of iterations executed by the original code, not for the restructured one, to have a meaningful comparison):
The code above executes in 7 cycles per iteration, a gain of one cycle over the original.
The code above with the multiplication operator replaced with our mul() function needs 6 cycles per iteration.
The restructured code does suffer from more register shuffling, which can't be avoided unfortunately (more variables). More recent processors like IntelH have architecture improvements that make register moves essentially free in many cases; this could make the code yield even larger gains. Using new instructions like MULX for IntelH could avoid some register moves altogether; GCC does use such instructions when compiling with -march=haswell.
Unanswered questions
None of the measurements that we have so far explain the large differences in performance reported by the OP, and observed by me on a different system.
My initial timings were taken on a remote system (Westmere family processor) where, of course, a lot of things could happen; yet, the results were strangely stable.
On that system, I also experimented with executing the second sample with a right shift and a left shift; the code using a right shift was consistently 50% slower than the other variant. I couldn't replicate that on my controlled test system on IntelSB, and I don't have an explanation for those results either.
We can discard all of the above as unpredictable side effects of compiler / processor / overall system behaviour, but I can't shake the feeling that not everything has been explained here.
It's true that it doesn't really make much sense to benchmark such tight loops without a controlled system, precise tools (counting cycles) and looking at the generated assembly code for each case. Compiler idiosyncrasies can easily result in code that artificially introduces differences of 50% or more in performance.
Another factor that could explain large differences is the presence of Intel Hyper-Threading. Different parts of the core behave differently when this is enabled, and the behaviour has also changed between processor families. This could have strange effects on tight loops.
To top everything off, here's a crazy hypothesis: Flipping bits consumes more power than keeping them constant. In our case, the first sample, working with zero values most of the time, will be flipping far fewer bits than the second one, so the latter will consume more power. Many modern processors have features that dynamically adjust the core frequency depending on electrical and thermal limits (Intel Turbo Boost / AMD Turbo Core). This means that, theoretically, under the right (or wrong?) conditions, the second sample could trigger a reduction of the core frequency, thus making the same number of cycles take longer time, and making the performance data-dependent.
After benchmarking both (using the assembly generated by GCC 4.7.3 on -O2) on my 4770K, I found that the first one takes 5 cycles per iteration and the second one takes 9 cycles per iteration. Why so much difference?
It turns out to be an interplay between throughput and latency. The main killer is shrd, which takes 3 cycles and is on the critical path. Here's a picture of it (I ignore the chain for i because it is faster and there is plenty of spare throughput for it to just run ahead, it will not interfere):
The edges here are dependencies, not dataflow.
Based solely on latencies in this chain, the expected time would be 8 cycles per iteration. But it is not. The problem here is that for 8 cycles to happen, mul2 and imul3 have to be executed in parallel, and integer multiplication only has a throughput of 1/cycle. So it (either one) has to wait a cycle, and holds up the chain by a cycle. I verified this by changing that imul to an add, which reduced the time to 8 cycles per iteration. Changing the other imul to an add had no effect, as predicted based on this explanation (it doesn't depend on shrd and can thus be scheduled earlier, without interfering with the other multiplications).
These exact details are only for Haswell.
The code I used was this:
section .text
global cmp1
proc_frame cmp1
[endprolog]
mov r8, rsi
mov r9, rdi
mov esi, 1
xor edi, edi
mov eax, 128
xor edx, edx
.L2:
mov rcx, rdx
mov rdx, rdi
imul rdx, rax
imul rcx, rsi
add rcx, rdx
mul rsi
add rdx, rcx
add rsi, 1
mov rcx, rsi
adc rdi, 0
xor rcx, 10000000
or rcx, rdi
jne .L2
mov rdi, r9
mov rsi, r8
ret
endproc_frame
global cmp2
proc_frame cmp2
[endprolog]
mov r8, rsi
mov r9, rdi
mov esi, 1
xor edi, edi
mov eax, 128
xor edx, edx
.L3:
mov rcx, rdi
imul rcx, rax
imul rdx, rsi
add rcx, rdx
mul rsi
add rdx, rcx
shrd rax, rdx, 2
sar rdx, 2
add rsi, 1
mov rcx, rsi
adc rdi, 0
xor rcx, 10000000
or rcx, rdi
jne .L3
mov rdi, r9
mov rsi, r8
ret
endproc_frame
Unless your processor can support native 128-bit operations, the operations will have to be software coded to use the next best option.
Your 128-bit operations are using the same scheme as the 8-bit processors did when using 16-bit operations, and this takes time.
For example, a 128-bit right shift, by one bit, using 64-bit registers requires:
Shift the Most Significant register right into carry. The Carry flag will contain the bit that was shifted out.
Shift the Least Significant register right, with carry. The bits will be shifted right, with the carry flag being shifted into the Most Significant Bit position.
Without support for native 128-bit operations, you code will take twice as many operations as the same 64-bit operations; sometimes more (such as multiplication). This is why you are seeing such poor performance.
I highly recommend only using 128-bits in places where it is extremely necessary.