Get bits from byte - c++

I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?

(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.

I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}

pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.

Related

Set the leading zero bits in any size integer C++

I want to set the leading zero bits in any size integer to 1 in standard C++.
eg.
0001 0011 0101 1111 -> 1111 0011 0101 1111
All the algorithms I've found to do this require a rather expensive leading zero count. However, it's odd. There are very fast and easy ways to do other types of bit manipulation such as:
int y = -x & x; //Extracts lowest set bit, 1110 0101 -> 0000 0001
int y = (x + 1) & x; //Will clear the trailing ones, 1110 0101 - > 1110 0100
int y = (x - 1) | x; //Will set the trailing zeros, 0110 0100 - > 0110 0111
So that makes me think there must be a way to set the leading zeros of an integer in one simple line of code consisting of basic bit wise operators. Please tell me there's hope because right now I'm settling for reversing the order of the bits in my integer and then using the fast way of setting trailing zeros and then reversing the integer again to get my leading zeros set to ones. Which is actually significantly faster than using a leading zero count, however still quite slow compared with the other algorithms above.
template<typename T>
inline constexpr void reverse(T& x)
{
T rev = 0;
size_t s = sizeof(T) * CHAR_BIT;
while(s > 0)
{
rev = (rev << 1) | (x & 0x01);
x >>= 1;
s -= 1uz;
}//End while
x = rev;
}
template<typename T>
inline constexpr void set_leading_zeros(T& x)
{
reverse(x);
x = (x - 1) | x;//Set trailing 0s to 1s
reverse(x);
}
Edit
Because some asked: I'm working with MS-DOS running on CPUs ranging from early X86 to a 486DX installed in older CNC machines.
Fun times. :D
The leading zeroes can be set without counting them, while also avoiding reversing the integer. For convenience I won't do it for a generic integer type T, but likely it can be adapted, or you could use template specialization.
First calculate the mask of all the bits that aren't the leading zeroes, by "spreading" the bits downwards:
uint64_t m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;
Then set all the bits that that mask doesn't cover:
return x | ~m;
Bonus: this automatically works even when x = 0 and when x has all bits set, one of which in a count-leading-zero approach could lead to an overly large shift amount (which one depends on the details, but almost always one of them is troublesome, since there are 65 distinct cases but only 64 valid shift amounts, if we're talking about uint64_t).
You could count leading zeroes using std::countl_zero and create a bitmask that your bitwise OR with the original value:
#include <bit>
#include <climits>
#include <type_traits>
template<class T>
requires std::is_unsigned_v<T>
T leading_ones(T v) {
auto lz = std::countl_zero(v);
return lz ? v | ~T{} << (CHAR_BIT * sizeof v - lz) : v;
}
If you have a std::uint16_t, like
0b0001001101011111
then ~T{} is 0b1111111111111111, CHAR_BIT * sizeof v is 16 and countl_zero(v) is 3. Left shift 0b1111111111111111 16-3 steps:
0b1110000000000000
Bitwise OR with the original:
0b0001001101011111
| 0b1110000000000000
--------------------
= 0b1111001101011111
Your reverse is extremely slow! With an N-bit int you need N iterations to reverse, each at least 6 instructions, then at least 2 instructions to set the trailing bits, and finally N iterations to reverse the value again. OTOH even the simplest leading zero count needs only N iterations, then set the leading bits directly:
template<typename T>
inline constexpr T trivial_ilog2(T x) // Slow, don't use this
{
if (x == 0) return 0;
size_t c{};
while(x)
{
x >>= 1;
c += 1u;
}
return c;
}
template<typename T>
inline constexpr T set_leading_zeros(T x)
{
if (std::make_unsigned_t(x) >> (sizeof(T) * CHAR_BIT - 1)) // top bit is set
return x;
return x | (-T(1) << trivial_ilog2(x));
}
x = set_leading_zeros(x);
There are many other ways to count leading zero/get integer logarithm much faster. One of the methods involves doing in steps of powers of 2 like how to create the mask in harold's answer:
What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
Find the highest order bit in C
How to efficiently count the highest power of 2 that is less than or equal to a given number?
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup
But since you're targeting a specific target instead of doing something cross-platform and want to squeeze every bit of performance, there are almost no reasons to use pure standard features since these usecases typically need platform-specific code. If intrinsics are available you should use it, for example in modern C++ there's std::countl_zero but each compiler already has intrinsics to do that which will map to the best instruction sequence for that platform, for example _BitScanReverse or __builtin_clz
If intrinsics aren't available of if the performance is still not enough then try a lookup table. For example here's a solution with 256-element log table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
uint16_t lut_ilog2_16(uint16_t x)
{
uint8_t h = x >> 8;
if (h) return LogTable256[h] + 8;
else return LogTable256[x & 0xFF];
}
In set_leading_zeros just call lut_ilog2_16 like above
The even better solution than a log table is a mask table so that you can get the mask directly instead of calculating 1 << LogTable256[x]
static const char MaskTable256[256] =
{
0xFF, 0xFE, 0xFC...
}
Some other notes:
1uz isn't a valid suffix in C++ prior to C++23
Don't use references for small types that fit in a single integer. That's not necessary and is usually slower when not inlined. Just assign the result back from the function
(Work in progress, power just went out here; posting now to save my work.)
Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.
(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)
8086 to 286 - no instruction available.
386: bsf/bsr: 10+3n cycles. Worst-case: 10+3*31 = 103c
486: bsf (16 or 32-bit registers): 6-42 cycles; bsr 7-104 cycles (1 cycle less for 16-bit regs).
P5 Pentium: bsf: 6-42 cycles (6-34 for 16-bit); bsr 7-71 cycles. (or 7-39 for 16-bit). Non-pairable.
Intel P6 and later: bsr/bsr: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).
AMD K7/K8/K10/Bulldozer/Zen: bsf/bsr are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
Intel Haswell / AMD K10 : lzcnt introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt and the rest of BMI1).
For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero / countr_zero respectively, unlike bsr/bsf. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)
On AMD, they're fast, single uop for lzcnt, with tzcnt taking one more (probably a bit-reverse to feed the lzcnt execution unit), so a nice win vs. bsf/bsr. This is why compilers typically use rep bsf when for countr_zero / __builtin_ctz, so it will run as tzcnt on CPUs that support it, but as bsf on older CPUs. They produce the same results for non-zero inputs, unlike bsr/lzcnt.
On Intel, same fast performance as bsf/bsr, even including the output dependency until Skylake fixed that; it's a true dependency for bsf/bsr, but false dependency for tzcnt/lzcnt and popcnt.
Fast algorithm with a bit-scan building block
But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine. (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)
Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)
bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.
I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.
Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.
int set_leading_zeros(int x){
int lz = __builtin_clz(x|1); // clamp the lzcount to 31 at most
int tmp = (x<<lz); // shift out leading zeros, leaving a 1 (or 0 if x==0)
tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1); // set the MSB in case x==0
return tmp>>lz; // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
int32_t lz = _lzcnt_u32(x); // returns 0 to 32
uint32_t mask = _bzhi_u32(-1, lz); // handles all 33 possible values, producing 0 for lz=32
return x | ~mask;
}
On x86-64 you can
Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.
Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg

How to get char from first bits per byte in uint?

I have uint64_t variable with some value (for example 0x700a06fffff48517). I want to get char with the first bit of each byte in the uint (so from 0x700a06fffff48517 I want 0b00011110). Is there a better way than this?
#include <inttypes>
char getFirstBits(uint64_t x) {
x >>= 7; // shift to put first bits to last bits in byte
char c = 0;
for (size_t i = 0; i < 8; i++) {
c <<= 1;
c |= x & 1;
x >>= 8;
}
return c;
}
The fastest I can think of on (recent) x86 is
#include <immintrin.h>
uint8_t getFirstBits(uint64_t val) {
return _pext_u64(val, 0x8080808080808080ULL);
}
This is a generic solution that doesn't depend on any CPU architectures
char getFirstBits(uint64_t x) {
x = (ntohll(x) >> 7) & 0x0101010101010101; // get the first bits
return 0x8040201008040201*x >> 56; // move them together
}
This is basically the multiplication technique where bits are moved around using a single multiplication with a magic number. The remaining bitwise operations are for removing the unnecessary bits. ntohll should be htobe64 on *nix. For more details about that technique and what the magic number means read
How to create a byte out of 8 bool values (and vice versa)?
What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?
You can also use SIMD to do it:
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
It found immintrin.h, but it cannot find _pext_u64 (it found _pext_u32), I guess it's because I'm on 32-bit windows. However, when I use _pext_u32 to process both halves of uint64, it crashes with unknown instruction (seems like my processor doesn't have the instruction).
PEXT is a new instruction in the BMI2 extension, so if your CPU doesn't support BMI2 then you can't use it. In 32-bit mode only the 32-bit version of PEXT is supported, that's why _pext_u64 doesn't work

Use bit manipulation to convert a bit from each byte in an 8-byte number to a single byte

I have a 64-bit unsigned integer. I want to check the 6th bit of each byte and return a single byte representing those 6th bits.
The obvious, "brute force" solution is:
inline const unsigned char Get6thBits(unsigned long long num) {
unsigned char byte(0);
for (int i = 7; i >= 0; --i) {
byte <<= 1;
byte |= bool((0x20 << 8 * i) & num);
}
return byte;
}
I could unroll the loop into a bunch of concatenated | statements to avoid the int allocation, but that's still pretty ugly.
Is there a faster, more clever way to do it? Maybe use a bitmask to get the 6th bits, 0x2020202020202020 and then somehow convert that to a byte?
If _pext_u64 is a possibility (this will work on Haswell and newer, it's very slow on Ryzen though), you could write this:
int extracted = _pext_u64(num, 0x2020202020202020);
This is a really literal way to implement it. pext takes a value (first argument) and a mask (second argument), at every position that the mask has a set bit it takes the corresponding bit from the value, and all bits are concatenated.
_mm_movemask_epi8 is more widely usable, you could use it like this:
__m128i n = _mm_set_epi64x(0, num);
int extracted = _mm_movemask_epi8(_mm_slli_epi64(n, 2));
pmovmskb takes the high bit of every byte in its input vector and concatenates them. The bits we want are not the high bit of every byte, so I move them up two positions with psllq (of course you could shift num directly). The _mm_set_epi64x is just some way to get num into a vector.
Don't forget to #include <intrin.h>, and none of this was tested.
Codegen seems reasonable enough
A weirder option is gathering the bits with a multiplication: (only slightly tested)
int extracted = (num & 0x2020202020202020) * 0x08102040810204 >> 56;
The idea here is that num & 0x2020202020202020 only has very few bits set, so we can arrange a product that never carries into bits that we need (or indeed at all). The multiplier is constructed to do this:
a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000 +
0b0000000c0000000d0000000e0000000f0000000g0000000h00000000000000 +
00c0000000d0000000e0000000f0000000g0000000h000000000000000000000 etc..
Then the top byte will have all the bits "compacted" together. The lower bytes actually have something like that too, but they're missing bits that would have to come from "higher" (bits can only move to the left in a multiplication).

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

How to create a byte out of 8 bool values (and vice versa)?

I have 8 bool variables, and I want to "merge" them into a byte.
Is there an easy/preferred method to do this?
How about the other way around, decoding a byte into 8 separate boolean values?
I come in assuming it's not an unreasonable question, but since I couldn't find relevant documentation via Google, it's probably another one of those "nonono all your intuition is wrong" cases.
The hard way:
unsigned char ToByte(bool b[8])
{
unsigned char c = 0;
for (int i=0; i < 8; ++i)
if (b[i])
c |= 1 << i;
return c;
}
And:
void FromByte(unsigned char c, bool b[8])
{
for (int i=0; i < 8; ++i)
b[i] = (c & (1<<i)) != 0;
}
Or the cool way:
struct Bits
{
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
};
union CBits
{
Bits bits;
unsigned char byte;
};
Then you can assign to one member of the union and read from another. But note that the order of the bits in Bits is implementation defined.
Note that reading one union member after writing another is well-defined in ISO C99, and as an extension in several major C++ implementations (including MSVC and GNU-compatible C++ compilers), but is Undefined Behaviour in ISO C++. memcpy or C++20 std::bit_cast are the safe ways to type-pun in portable C++.
(Also, the bit-order of bitfields within a char is implementation defined, as is possible padding between bitfield members.)
You might want to look into std::bitset. It allows you to compactly store booleans as bits, with all of the operators you would expect.
No point fooling around with bit-flipping and whatnot when you can abstract away.
The cool way (using the multiplication technique)
inline uint8_t pack8bools(bool* a)
{
uint64_t t;
memcpy(&t, a, sizeof t); // strict-aliasing & alignment safe load
return 0x8040201008040201ULL*t >> 56;
// bit order: a[0]<<7 | a[1]<<6 | ... | a[7]<<0 on little-endian
// for a[0] => LSB, use 0x0102040810204080ULL on little-endian
}
void unpack8bools(uint8_t b, bool* a)
{
// on little-endian, a[0] = (b>>7) & 1 like printing order
auto MAGIC = 0x8040201008040201ULL; // for opposite order, byte-reverse this
auto MASK = 0x8080808080808080ULL;
uint64_t t = ((MAGIC*b) & MASK) >> 7;
memcpy(a, &t, sizeof t); // store 8 bytes without UB
}
Assuming sizeof(bool) == 1
To portably do LSB <-> a[0] (like the pext/pdep version below) instead of using the opposite of host endianness, use htole64(0x0102040810204080ULL) as the magic multiplier in both versions. (htole64 is from BSD / GNU <endian.h>). That arranges the multiplier bytes to match little-endian order for the bool array. htobe64 with the same constant gives the other order, MSB-first like you'd use for printing a number in base 2.
You may want to make sure that the bool array is 8-byte aligned (alignas(8)) for performance, and that the compiler knows this. memcpy is always safe for any alignment, but on ISAs that require alignment, a compiler can only inline memcpy as a single load or store instruction if it knows the pointer is sufficiently aligned. *(uint64_t*)a would promise alignment, but also violate the strict-aliasing rule. Even on ISAs that allow unaligned loads, they can be faster when naturally aligned. But the compiler can still inline memcpy without seeing that guarantee at compile time.
How they work
Suppose we have 8 bools b[0] to b[7] whose least significant bits are named a-h respectively that we want to pack into a single byte. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So the magic number for packing would be 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201. If you're on a big endian machine you'll need to use the magic number 0x0102040810204080 which is calculated in a similar manner
For unpacking we can do a similar multiplication
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
abcdefgh
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000
After multiplying we have the needed bits at the most significant positions, so we need to mask out irrelevant bits and shift the remaining ones to the least significant positions. The output will be the bytes contain a to h in little endian.
The efficient way
On newer x86 CPUs with BMI2 there are PEXT and PDEP instructions for this purpose. The pack8bools function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And the unpack8bools function can be implemented as
_pdep_u64(b, 0x0101010101010101ULL);
(This maps LSB -> LSB, like a 0x0102040810204080ULL multiplier constant, opposite of 0x8040201008040201ULL. x86 is little-endian: a[0] = (b>>0) & 1; after memcpy.)
Unfortunately those instructions are very slow on AMD before Zen 3 so you may need to compare with the multiplication method above to see which is better
The other fast way is SSE2
x86 SIMD has an operation that takes the high bit of every byte (or float or double) in a vector register, and gives it to you as an integer. The instruction for bytes is pmovmskb. This can of course do 16 bytes at a time with the same number of instructions, so it gets better than the multiply trick if you have lots of this to do.
#include <immintrin.h>
inline uint8_t pack8bools_SSE2(const bool* a)
{
__m128i v = _mm_loadl_epi64( (const __m128i*)a ); // 8-byte load, despite the pointer type.
// __m128 v = _mm_cvtsi64_si128( uint64 ); // alternative if you already have an 8-byte integer
v = _mm_slli_epi32(v, 7); // low bit of each byte becomes the highest
return _mm_movemask_epi8(v);
}
There isn't a single instruction to unpack until AVX-512, which has mask-to-vector instructions. It is doable with SIMD, but likely not as efficiently as the multiply trick. See Convert 16 bits mask to 16 bytes mask and more generally is there an inverse instruction to the movemask instruction in intel avx2? for unpacking bitmaps to other element sizes.
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD has some answers specifically for 8-bits -> 8-bytes, but if you can't do 16 bits at a time for that direction, the multiply trick is probably better, and pext certainly is (except on CPUs where it's disastrously slow, like AMD before Zen 3).
#include <stdint.h> // to get the uint8_t type
uint8_t GetByteFromBools(const bool eightBools[8])
{
uint8_t ret = 0;
for (int i=0; i<8; i++) if (eightBools[i] == true) ret |= (1<<i);
return ret;
}
void DecodeByteIntoEightBools(uint8_t theByte, bool eightBools[8])
{
for (int i=0; i<8; i++) eightBools[i] = ((theByte & (1<<i)) != 0);
}
bool a,b,c,d,e,f,g,h;
//do stuff
char y= a<<7 | b<<6 | c<<5 | d<<4 | e <<3 | f<<2 | g<<1 | h;//merge
although you are probably better off using a bitset
http://www.cplusplus.com/reference/stl/bitset/bitset/
I'd like to note that type punning through unions is UB in C++ (as rodrigo does in his answer. The safest way to do that is memcpy()
struct Bits
{
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
};
unsigned char toByte(Bits b){
unsigned char ret;
memcpy(&ret, &b, 1);
return ret;
}
As others have said, the compiler is smart enough to optimize out memcpy().
BTW, this is the way that Boost does type punning.
There is no way to pack 8 bool variables into one byte. There is a way packing 8 logical true/false states in a single byte using Bitmasking.
You would use the bitwise shift operation and casting to archive it. a function could work like this:
unsigned char toByte(bool *bools)
{
unsigned char byte = \0;
for(int i = 0; i < 8; ++i) byte |= ((unsigned char) bools[i]) << i;
return byte;
}
Thanks Christian Rau for the correction s!