Counting the number of leading zeros in a 128-bit integer

Counting the number of leading zeros in a 128-bit integer - c++

How can I count the number of leading zeros in a 128-bit integer (uint128_t) efficiently?
I know GCC's built-in functions:
__builtin_clz, __builtin_clzl, __builtin_clzll
__builtin_ffs, __builtin_ffsl, __builtin_ffsll
However, these functions only work with 32- and 64-bit integers.
I also found some SSE instructions:
__lzcnt16, __lzcnt, __lzcnt64
As you may guess, these only work with 16-, 32- and 64-bit integers.
Is there any similar, efficient built-in functionality for 128-bit integers?

inline int clz_u128 (uint128_t u) {
uint64_t hi = u>>64;
uint64_t lo = u;
int retval[3]={
__builtin_clzll(hi),
__builtin_clzll(lo)+64,
128
};
int idx = !hi + ((!lo)&(!hi));
return retval[idx];
}
this is a branch free variant. Note that more work is done than in the branchy solution, and in practice the branching will probably be predictable.
It also relies on __builtin_clzll not crashing when fed 0: the docs say the result is undefined, but is it just unspecified or undefined?

Assuming a 'random' distribution, the first non-zero bit will be in the high 64 bits, with an overwhelming probability, so it makes sense to test that half first.
Have a look at the code generated for:
/* inline */ int clz_u128 (uint128_t u)
{
unsigned long long hi, lo; /* (or uint64_t) */
int b = 128;
if ((hi = u >> 64) != 0) {
b = __builtin_clzll(hi);
}
else if ((lo = u & ~0ULL) != 0) {
b = __builtin_clzll(lo) + 64;
}
return b;
}
I would expect gcc to implement each __builtin_clzll using the bsrq instruction - bit scan reverse, i.e., most-significant bit position - in conjunction with an xor, (msb ^ 63), or sub, (63 - msb), to turn it into a leading zero count. gcc might generate lzcnt instructions with the right -march= (architecture) options.
Edit: others have pointed out that the 'distribution' is not relevant in this case, since the HI uint64_t needs to be tested regardless.

Yakk's answer works well for all kinds of targets as long as gcc supports
128 bit integers for the target. However, note that on the x86-64 platform,
with an Intel Haswell processor or newer, there is a more efficient solution:
#include <immintrin.h>
#include <stdint.h>
// tested with compiler options: gcc -O3 -Wall -m64 -mlzcnt
inline int lzcnt_u128 (unsigned __int128 u) {
uint64_t hi = u>>64;
uint64_t lo = u;
lo = (hi == 0) ? lo : -1ULL;
return _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
The _lzcnt_u64 intrinsic compiles (gcc 5.4) to the lzcnt instruction, which is well
defined for a zero input (it returns 64), in contrary to gcc's __builtin_clzll().
The ternary operator compiles to the cmove instruction.

Related

Fastest way to strip trailing zeroes from an unsigned int

Suppose we are trying to remove the trailing zeroes from some unsigned variable.
uint64_t a = ...
uint64_t last_bit = a & -a; // Two's complement trick: last_bit holds the trailing bit of a
a /= last_bit; // Removing all trailing zeroes from a.
I noticed that it's faster to manually count the bits and shift. (MSVC compiler with optimizations on)
uint64_t a = ...
uint64_t last_bit = a & -a;
size_t last_bit_index = _BitScanForward64( last_bit );
a >>= last_bit_index
Are there any further quick tricks that would make this even faster, assuming that the compiler intrinsic _BitScanForward64 is faster than any of the alternatives?

On x86, _tzcnt_u64 is a faster alterative of _BitScanForward64, if it is available (it is available with BMI instruction set).
Also, you can directly use that on the input, you don't need to isolate lowest bit set, as pointed out by #AlanBirtles in a comment.
Other than that, noting can be done for a single variable. For an array of them, there may be a SIMD solution.

You can use std::countr_zero (c++20) and rely on the compiler to optimize it.
a >>= std::countr_zero(a);
(bonus: you don't need to specify the width and it works with any unsigned integer type)

What is the fastest way to check the leading characters in a char array?

I reached a bottleneck in my code, so the main issue of this question is performance.
I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:
bool starts_with (char* cksum_hex, int n_zero) {
bool flag {true};
for (int i=0; i<n_zero; ++i)
flag &= (cksum_hex[i]=='0');
return flag;
}
The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.
I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:
// I have to convert cksum to string
std::string cksum_hex_s (cksum_hex);
cksum_hex_s.starts_with("000"); // checking for 3 leading zeros
For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.
Questions
What is the faster way of checking the leading characters in a char array?
Is there a more efficient way of doing it with std::string::starts_with?
Does the bitwise operations help here?

If you modify your function to return early
bool starts_with (char* cksum_hex, int n_zero) {
for (int i=0; i<n_zero; ++i)
{
if (cksum_hex[i] != '0') return false;
}
return true;
}
It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:
// make it as big as you need
constexpr char cmp_array[4] = {'0', '0', '0', '0'};
bool starts_with (char* cksum_hex, int n_zero) {
return std::memcmp(cksum_hex, cmp_array, n_zero) == 0;
}
The problem here is that you need to assume some max possible value of n_zero.
Live example
=== EDIT ===
Considering the complains about no profiling data to justify the suggested approaches, here you go:
Benchmark results comparing early return implementation with memcmp implementation
Benchmark results comparing memcmp implementation with OP original implementation
Data used:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
memcmp is fastest in all cases but cs2 with early return impl.

Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.
e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.
I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.
bool starts_with (uint64_t cksum_high8, int n_zero)
{
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
return (cksum_high8 >> shift) == 0;
}
clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2
high_zero_nibbles(unsigned long, int):
shl esi, 2
neg sil # x86 shifts wrap the count so 64 - c is the same as -c
shrx rax, rdi, rsi # BMI2 variable-count shifts save some uops.
test rax, rax
sete al
ret
This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.
Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)
bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {
int shift = 64 - n_zero * 4; // A hex digit represents a 4-bit nibble
uint64_t low4n_mask = (1ULL << shift) - 1;
return cksum_high8 & ~low4n_mask;
}
Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.
This is great if you want to know how many and don't have one specific cutoff threshold.
bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {
// UB on cksum_high8 == 0. Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0
return __builtin_clzll(cksum_high8) > 4*n_zero;
}
#include <bit>
bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {
return std::countl_zero(cksum_high8) > 4*n_zero;
}
compile to (clang for Haswell):
high_zero_nibbles_lzcnt(unsigned long, int):
lzcnt rax, rdi
shl esi, 2
cmp esi, eax
setl al # FLAGS -> boolean integer return value
ret
All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.
See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.
To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.
Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.

You can make a buffer of zeros large enough for you than compare with memcmp.
const char *zeroBuffer = "000000000000000000000000000000000000000000000000000";
if (memcmp(zeroBuffer, cksum_hex, n_zero) == 0) {
// ...
}

Things you want to check to make your application faster:
1. Can the compiler inline this function in places where it is called?
Either declare the function as inline in a header or put the definition in the compile unit where it is used.
2. Not computing something is faster than computing something more efficiently
Are all calls to this function necessary? High cost is generally the sign of a function called inside a high frequency loop or in an expensive algorithm. You can often reduce the call count, hence the time spent in the function, by optimizing the outer algorithm
3. Is n_zero small or, even better, a constant?
Compilers are pretty good at optimizing algorithm for typically small constant values. If the constant is known to the compiler, it will most likely remove the loop entirely.
4. Does the bitwise operation help here?
It definitely has an effect and allows Clang (but not GCC as far as I can tell) to do some vectorization. Vectorization tend to be faster but that's not always the case depending on your hardware and actual data processed.
Whether it is an optimization or not might depend on how big n_zero is. Considering you are processing checksums, it should be pretty small so it sounds like a potential optimization.
For known n_zero using bitwise operation allows the compiler to remove all branching. I expect, though I did not measure, this to be faster.
std::all_of and std::string::starts_with should be compiled exactly as your implementation except they will use && instead of &.

Unless n_zero is quite high I would agree with others that you may be misinterpreting profiler results. But anyway:
Could the data be swapped to disk? If your system is under RAM pressure, data could be swapped out to disk and need to be loaded back to RAM when you perform the first operation on it. (Assuming this checksum check is the first access to the data in a while.)
Chances are you could use multiple threads/processes to take advantage of a multicore processor.
Maybe you could use statistics/correlation of your input data, or other structural features of your problem.
For instance, if you have a large number of digits (e.g. 50) and you know that the later digits have a higher probability of being nonzero, you can check the last one first.
If nearly all of your checksums should match, you can use [[likely]] to give a compiler hint that this is the case. (Probably won't make a difference but worth a try.)

Adding my two cents to this interesting discussion, though a little late to the game, I gather you could use std::equal, it's a fast method with a slightly different approach, using a hardcoded string with the maximum number of zeros, instead of the number of zeros.
This works passing to the function pointers to the begin and end of the string to be searched, and to string of zeros, specifically iterators to begin and end, end pointing to position of one past of the wanted number of zeros, these will be used as iterators by std::equal:
Sample
bool startsWith(const char* str, const char* end, const char* substr, const char* subend) {
return std::equal(str, end, substr, subend);
}
int main() {
const char* str = "000x1234567";
const char* substr = "0000000000000000000000000000";
std::cout << startsWith(&str[0], &str[3], &substr[0], &substr[3]);
}
Using the test cases in #pptaszni's good answer and the same testing conditions:
const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";
The result where as follows:
Slower than using memcmp but still faster (except for false results with low number of zeros) and more consistent than your original code.

Use std::all_of
return std::all_of(chsum_hex, chsum_hex + n_zero, [](char c){ return c == '0'; })

Why does MSVC not use __popcnt in its implementation for std::bitset::count?

I was curious to see whether or not MSVC used the compiler intrinsic __popcnt for bitset::count.
Looking around, I found this to be the implementation for std::bitset::count for VS2017:
size_t count() const _NOEXCEPT
{ // count number of set bits
const char *const _Bitsperbyte =
"\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\4\5\5\6\5\6\6\7\5\6\6\7\6\7\7\x8";
const unsigned char *_Ptr = &reinterpret_cast<const unsigned char&>(_Array);
const unsigned char *const _End = _Ptr + sizeof (_Array);
size_t _Val = 0;
for ( ; _Ptr != _End; ++_Ptr)
_Val += _Bitsperbyte[*_Ptr];
return (_Val);
}
It looks like its using lookup tables to get the number of bits for any given byte, and then counts the number of 1's for every byte.
According to this answer here, GCC implements it like this (along the lines of what I was thinking):
/// Returns the number of bits which are set.
size_t
count() const { return this->_M_do_count(); }
size_t
_M_do_count() const
{
size_t __result = 0;
for (size_t __i = 0; __i < _Nw; __i++)
__result += __builtin_popcountl(_M_w[__i]);
return __result;
}
Although I didn't benchmark anything, I would bet good money that GCC's implementation would be quite a bit faster here.
Hence, are there any compelling reason why MSVC implemented std::bitset::count like this? My guess is either MSVC has a catch-all "no compiler intrinsics in STL" policy, or there's a difference between the two platforms that I'm overlooking.

Internal implementation of __builtin_popcountl in GCC is not better, it is something like below depending on architecture.
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;
And only for SSE4a instruction set, supported only in AMD CPUs starting from the Year 2006, __builtin_popcountl consists of one assembler instruction POPCNT.
MSDN says
Each of these intrinsics generates the popcnt instruction. The size of the value that the popcnt instruction returns is the same as the size of its argument. In 32-bit mode there are no 64-bit general-purpose registers, hence no 64-bit popcnt.
To determine hardware support for the popcnt instruction, call the __cpuid intrinsic with InfoType=0x00000001 and check bit 23 of CPUInfo[2] (ECX). This bit is 1 if the instruction is supported, and 0 otherwise. If you run code that uses this intrinsic on hardware that does not support the popcnt instruction, the results are unpredictable.
I assume MSVC team did not want to use the intrinsic with conditions in favor of one common solution independent from CPUs and architectures.

There's no "no compiler intrinsics in STL" policy, but there's a requirement that STL should run on all supported CPU, and minimum CPU is picked by minimum requirements of minimum OS version. Therefore, it is possible to use intrincics that inject instructions for later CPUs only if fallback for older CPUs is provided.
Currently C++20 std::popcount in VS 2019 uses runtime CPU detection, it falls back to bit counting.
std::bitset::count could start using the same approach too. There's an issue in STL GitHub for that waiting for a maintainer or a contributor to implement.

Getting the high half and low half of a full integer multiply

I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where
D = high(A*C);
E = low(A*C) + high(B*C);
I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.
I try to obtain optimized code of some allready functional one. I have a short part of the code in huge loop which is just few command lines, however it consumes almost all of computational time (physical simulation for couple of hours computing). That's the reason why i try to optimized this little part and rest of the code could remain more "user-well-arranged".
There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.
Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.
Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
Read first 32 bits of the result as D and last 32 bits of the result as E.
This algorithm should return correct values for E and D.
My question:
Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?
I use TDM-GCC 4.9.2 64-bit compiler.
(note: Question was modified after advice)
(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)

You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.
They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.
Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.
That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.
I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)
When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)
// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>
struct DE { uint32_t D,E; };
struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
uint64_t AC = A * (uint64_t)C;
uint64_t BC = B * (uint64_t)C;
uint32_t D = AC >> 32; // high half
uint32_t E = AC + (BC >> 32); // We could cast to uint32_t before adding, but don't need to
struct DE retval = { D, E };
return retval;
}
#ifdef __SIZEOF_INT128__ // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };
struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
__uint128_t AC = A * (__uint128_t)C;
__uint128_t BC = B * (__uint128_t)C;
uint64_t D = AC >> 64; // high half
uint64_t E = AC + (BC >> 64);
struct DE64 retval = { D, E };
return retval;
}
#endif

If I understand it correctly, you want to compute number of potential overflows in A*B. If yes then you have 2 good options - the "use twice as big variable" (write 128bit math function for uint64 - it's not that hard (or wait for me to post it tomorrow)), and the "use floating point type":
(float(A)*float(B))/float(C)
as the loss of precision is minimal (assuming float is 4 bytes, double 8 bytes, and long double 16 bytes long) , and both float and uint32 require 4 bytes of memory (use double for uint64_t as it should be 8 bytes long):
#include <iostream>
#include <conio.h>
#include <stdint.h>
using namespace std;
int main()
{
uint32_t a(-1), b(-1);
uint64_t result1;
float result2;
result1 = uint64_t(a)*uint64_t(b)/4294967296ull; // >>32 would be faster and less memory consuming
result2 = float(a)*float(b)/4294967296.0f;
cout.precision(20);
cout<<result1<<'\n'<<result2;
getch();
return 0;
}
Produces:
4294967294
4294967296
But if you want really precise and correct answer I'd suggest using twice as big type for computing
Now that I think of it - you could use long double for uint64 and double for uint32 instead of writing function for uint64, but I don't think it's guaranteed that long double will be 128bit, and you'll have to check it. I'd go for more universal option.
EDIT:
You can write function to calculate that without using anything more
than A, B and result variable which would be of the same type as A.
Just add rightmost bit of (where Z equals B*(A>>pass_number&1)) Z<<0,
Z<<1, Z<<2 (...) Z<<X in first pass, Z<<-1, Z<<0, Z<<1 (...) Z<<(X-1)
for second (there should be X passes), while right shifting the result
by 1 (the just computed bit becomes irrelevant to us after it's
computed as it won't participate in calculation anymore, and it would
be erased anyway after dividing by 2^X (doing >>X)
(had to place in the "code" as I'm new here and couldn't find another way to prevent formatting script from eating half of it)
It's just a quick idea. You'll have to check it's correctness (sorry, but I'm really tired right now - but the result shouldn't overflow at any point of calculation, as the maximum carry would have value of 2X if I'm correct, and the algorithm itself seems to be good).
I will write code for that tomorrow if you'll still be in need of help.

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here):
It takes a value r2 and a mask r3 and puts the extracted bits of r2 into r1.
My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.

Here is some code from Matthew Fioravante's stdcxx-bitops GitHub repo that was floated to the std-proposals mailinglist as a preliminary proposal to add a constexpr bitwise operations library for C++.
#ifndef HAS_CXX14_CONSTEXPR
#define HAS_CXX14_CONSTEXPR 0
#endif
#if HAS_CXX14_CONSTEXPR
#define constexpr14 constexpr
#else
#define constexpr14
#endif
//Parallel Bits Extract
//x HGFEDCBA
//mask 01100100
//res 00000GFC
//x86_64 BMI2: PEXT
template <typename Integral>
constexpr14 Integral extract_bits(Integral x, Integral mask) {
Integral res = 0;
for(Integral bb = 1; mask != 0; bb += bb) {
if(x & mask & -mask) {
res |= bb;
}
mask &= (mask - 1);
}
return res;
}

As of August 2022, there still does not seem to be a way to write code for which a compiler will generate a PEXT instruction.
However, at this point (C++20), you can call many functions in <bit> that wrap assembly code.
In general, you can get a compiler to generate assembly instructions, where supported, for all functions in TBM/BMI and for BZHI from BMI2, which all can be described with simple expressions.
PDEP is non-trivial and has the same complexity as PDEP, with multiple possible implementations. So, neither seem straightforward for an optimizer to recognize.
The ABM instructions, POPCNT and LZCNTare also non-trivial, and implementations could not be recognized. Fortunately, we have std::popcount and std::countl_zero, respectively, which can map directly to the corresponding instruction (where the hardware supports it).
It would seem that PEXT and PDEP will likely be similarly supported before the time compilers can infer that an instruction can replace the algorithm.
Now that C++ is officially two's compliment, it would be nice to see both arithmetic and logical shift right wrapped so that one could explicitly use them on signed or unsigned, as required.
As far as PEXT implementations go, here's a variation that might compare favorably to TemplateRex's (Matthew Fioravante's) version at https://stackoverflow.com/a/21159523/2963099
template <typename Integral>
constexpr Integral extract_bits(Integral x, Integral mask) {
Integral res=0;
int bb=1;
do {
Integral lsb=mask & -mask;
mask &= ~lsb;
bool isset=x & lsb;
res |= isset ? bb : 0;
bb+=bb;
} while (mask);
return res;
}
You can compare them both on Compiler Explorer at https://godbolt.org/z/3h9WrYqxT
Mostly this breaks out the least significant byte (lsb) to remove it from mask, and test against x
It is safe to run through once with mask === 0. (lsb will be 0, so isset is false). Using a do while is much more efficient except for that trivial case.
Using the ternary operator is mostly stylist since, to me, it is a stronger hint of the intent to generate a cmove

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Counting the number of leading zeros in a 128-bit integer - c++

Related

Fastest way to strip trailing zeroes from an unsigned int

What is the fastest way to check the leading characters in a char array?

Why does MSVC not use __popcnt in its implementation for std::bitset::count?

Getting the high half and low half of a full integer multiply

Standard C++11 code equivalent to the PEXT Haswell instruction (and likely to be optimized by compiler)

Categories

Resources