How to get char from first bits per byte in uint? - c++

I have uint64_t variable with some value (for example 0x700a06fffff48517). I want to get char with the first bit of each byte in the uint (so from 0x700a06fffff48517 I want 0b00011110). Is there a better way than this?
#include <inttypes>
char getFirstBits(uint64_t x) {
x >>= 7; // shift to put first bits to last bits in byte
char c = 0;
for (size_t i = 0; i < 8; i++) {
c <<= 1;
c |= x & 1;
x >>= 8;
}
return c;
}

The fastest I can think of on (recent) x86 is
#include <immintrin.h>
uint8_t getFirstBits(uint64_t val) {
return _pext_u64(val, 0x8080808080808080ULL);
}

This is a generic solution that doesn't depend on any CPU architectures
char getFirstBits(uint64_t x) {
x = (ntohll(x) >> 7) & 0x0101010101010101; // get the first bits
return 0x8040201008040201*x >> 56; // move them together
}
This is basically the multiplication technique where bits are moved around using a single multiplication with a magic number. The remaining bitwise operations are for removing the unnecessary bits. ntohll should be htobe64 on *nix. For more details about that technique and what the magic number means read
How to create a byte out of 8 bool values (and vice versa)?
What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?
You can also use SIMD to do it:
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
It found immintrin.h, but it cannot find _pext_u64 (it found _pext_u32), I guess it's because I'm on 32-bit windows. However, when I use _pext_u32 to process both halves of uint64, it crashes with unknown instruction (seems like my processor doesn't have the instruction).
PEXT is a new instruction in the BMI2 extension, so if your CPU doesn't support BMI2 then you can't use it. In 32-bit mode only the 32-bit version of PEXT is supported, that's why _pext_u64 doesn't work

Related

C/C++ bit array resolution transform algorithms

Anyone aware of any algorithms to up/down convert bit arrays?
ie: when the resolution is 1/16:
every 1 bit = 16 bits. (low resolution to high resolution)
1010 -> 1111111111111111000000000000000011111111111111110000000000000000
and reverse, 16 bits = 1 bit (high resolution to low resolution)
1111111111111111000000000000000011111111111111110000000000000000 -> 1010
Right now I am looping bit by bit which is not efficient. Using a whole 64-bit word would be better but run into issues when the word isn't divisible by resolution equally (some bits may spill over to the next word).
C++:
std::vector<uint64_t> bitset;
C:
uint64_t *bitset = calloc(total_bits >> 6, sizeof(uint64_t)); // free() when done
which is accessed using:
const uint64_t idx = bit >> 6;
const uint64_t pos = bit % 64;
const bool value = (bitset[idx] >> pos) & 1U;
and set/clear:
bitset[idx] |= (1UL << pos);
bitset[idx] &= ~(1UL << pos);
and the OR (or AND/XOR/AND/NOT) of two bitsets of same resolution are done using the full 64-bit word:
bitset[idx] |= source.bitset[idx];
I am dealing with large enough bitsets (2+ billion bits) that I'm looking for any efficiency in the loops. One way I found to optimize the loop is to check each word using __builtin_popcountll, and skip ahead in the loop:
for (uint64_t bit = 0; bit < total_bits; bit++)
{
const uint64_t idx = bit >> 6;
const uint64_t pos = bit % 64;
const uint64_t bits = __builtin_popcountll(bitset[idx]);
if (!bits)
{
i += 63;
continue;
}
// process
}
I'm looking for algorithms/techniques more than code examples. But if you have code to share, I won't say no. Any academic research papers would be appreciated too.
Thanks in advance!
Does the resolution always between 1/2 and 1/64? Or even 1/32? Because if you need very long sequence, you might need more loop nesting which might cause some slow down.
Are you sequence always very long (millions of bits) or this is a maximum but usually your sequences are shorter? When doing high to low resolution, can you assume that data is valid or not.
Here are some tricks:
uint64_t one = 1;
uint64_t n_one_bits = (one << n) - 1u; // valid for 0 to 63; not sure for 64
If your sequence are so long, you might want to check if n is some power of 2 and have more optimized code for those cases.
You might find some other useful tricks here:
https://graphics.stanford.edu/~seander/bithacks.html
So if your resolution is 1/16, you don't need to loop individual 16 bits but you can check all 16 bits at once. Then you can repeat for next group again and again.
If the number is not an a divider of 64, you can shift bits as appropriate each time you would cross the 64 bits boundary. Say, that your resolution is 1/5, then you could process 60 bits, then shift 4 remaining bit and combine with following 60 bits.
If you can assume that data is valid, then you don't even need to shift the original number as you can pick the value of the appropriate bit each time.

Use bit manipulation to convert a bit from each byte in an 8-byte number to a single byte

I have a 64-bit unsigned integer. I want to check the 6th bit of each byte and return a single byte representing those 6th bits.
The obvious, "brute force" solution is:
inline const unsigned char Get6thBits(unsigned long long num) {
unsigned char byte(0);
for (int i = 7; i >= 0; --i) {
byte <<= 1;
byte |= bool((0x20 << 8 * i) & num);
}
return byte;
}
I could unroll the loop into a bunch of concatenated | statements to avoid the int allocation, but that's still pretty ugly.
Is there a faster, more clever way to do it? Maybe use a bitmask to get the 6th bits, 0x2020202020202020 and then somehow convert that to a byte?
If _pext_u64 is a possibility (this will work on Haswell and newer, it's very slow on Ryzen though), you could write this:
int extracted = _pext_u64(num, 0x2020202020202020);
This is a really literal way to implement it. pext takes a value (first argument) and a mask (second argument), at every position that the mask has a set bit it takes the corresponding bit from the value, and all bits are concatenated.
_mm_movemask_epi8 is more widely usable, you could use it like this:
__m128i n = _mm_set_epi64x(0, num);
int extracted = _mm_movemask_epi8(_mm_slli_epi64(n, 2));
pmovmskb takes the high bit of every byte in its input vector and concatenates them. The bits we want are not the high bit of every byte, so I move them up two positions with psllq (of course you could shift num directly). The _mm_set_epi64x is just some way to get num into a vector.
Don't forget to #include <intrin.h>, and none of this was tested.
Codegen seems reasonable enough
A weirder option is gathering the bits with a multiplication: (only slightly tested)
int extracted = (num & 0x2020202020202020) * 0x08102040810204 >> 56;
The idea here is that num & 0x2020202020202020 only has very few bits set, so we can arrange a product that never carries into bits that we need (or indeed at all). The multiplier is constructed to do this:
a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000 +
0b0000000c0000000d0000000e0000000f0000000g0000000h00000000000000 +
00c0000000d0000000e0000000f0000000g0000000h000000000000000000000 etc..
Then the top byte will have all the bits "compacted" together. The lower bytes actually have something like that too, but they're missing bits that would have to come from "higher" (bits can only move to the left in a multiplication).

Get bits from byte

I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?
(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.
I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}
pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

How to create a byte out of 8 bool values (and vice versa)?

I have 8 bool variables, and I want to "merge" them into a byte.
Is there an easy/preferred method to do this?
How about the other way around, decoding a byte into 8 separate boolean values?
I come in assuming it's not an unreasonable question, but since I couldn't find relevant documentation via Google, it's probably another one of those "nonono all your intuition is wrong" cases.
The hard way:
unsigned char ToByte(bool b[8])
{
unsigned char c = 0;
for (int i=0; i < 8; ++i)
if (b[i])
c |= 1 << i;
return c;
}
And:
void FromByte(unsigned char c, bool b[8])
{
for (int i=0; i < 8; ++i)
b[i] = (c & (1<<i)) != 0;
}
Or the cool way:
struct Bits
{
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
};
union CBits
{
Bits bits;
unsigned char byte;
};
Then you can assign to one member of the union and read from another. But note that the order of the bits in Bits is implementation defined.
Note that reading one union member after writing another is well-defined in ISO C99, and as an extension in several major C++ implementations (including MSVC and GNU-compatible C++ compilers), but is Undefined Behaviour in ISO C++. memcpy or C++20 std::bit_cast are the safe ways to type-pun in portable C++.
(Also, the bit-order of bitfields within a char is implementation defined, as is possible padding between bitfield members.)
You might want to look into std::bitset. It allows you to compactly store booleans as bits, with all of the operators you would expect.
No point fooling around with bit-flipping and whatnot when you can abstract away.
The cool way (using the multiplication technique)
inline uint8_t pack8bools(bool* a)
{
uint64_t t;
memcpy(&t, a, sizeof t); // strict-aliasing & alignment safe load
return 0x8040201008040201ULL*t >> 56;
// bit order: a[0]<<7 | a[1]<<6 | ... | a[7]<<0 on little-endian
// for a[0] => LSB, use 0x0102040810204080ULL on little-endian
}
void unpack8bools(uint8_t b, bool* a)
{
// on little-endian, a[0] = (b>>7) & 1 like printing order
auto MAGIC = 0x8040201008040201ULL; // for opposite order, byte-reverse this
auto MASK = 0x8080808080808080ULL;
uint64_t t = ((MAGIC*b) & MASK) >> 7;
memcpy(a, &t, sizeof t); // store 8 bytes without UB
}
Assuming sizeof(bool) == 1
To portably do LSB <-> a[0] (like the pext/pdep version below) instead of using the opposite of host endianness, use htole64(0x0102040810204080ULL) as the magic multiplier in both versions. (htole64 is from BSD / GNU <endian.h>). That arranges the multiplier bytes to match little-endian order for the bool array. htobe64 with the same constant gives the other order, MSB-first like you'd use for printing a number in base 2.
You may want to make sure that the bool array is 8-byte aligned (alignas(8)) for performance, and that the compiler knows this. memcpy is always safe for any alignment, but on ISAs that require alignment, a compiler can only inline memcpy as a single load or store instruction if it knows the pointer is sufficiently aligned. *(uint64_t*)a would promise alignment, but also violate the strict-aliasing rule. Even on ISAs that allow unaligned loads, they can be faster when naturally aligned. But the compiler can still inline memcpy without seeing that guarantee at compile time.
How they work
Suppose we have 8 bools b[0] to b[7] whose least significant bits are named a-h respectively that we want to pack into a single byte. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So the magic number for packing would be 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201. If you're on a big endian machine you'll need to use the magic number 0x0102040810204080 which is calculated in a similar manner
For unpacking we can do a similar multiplication
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
abcdefgh
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000
After multiplying we have the needed bits at the most significant positions, so we need to mask out irrelevant bits and shift the remaining ones to the least significant positions. The output will be the bytes contain a to h in little endian.
The efficient way
On newer x86 CPUs with BMI2 there are PEXT and PDEP instructions for this purpose. The pack8bools function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And the unpack8bools function can be implemented as
_pdep_u64(b, 0x0101010101010101ULL);
(This maps LSB -> LSB, like a 0x0102040810204080ULL multiplier constant, opposite of 0x8040201008040201ULL. x86 is little-endian: a[0] = (b>>0) & 1; after memcpy.)
Unfortunately those instructions are very slow on AMD before Zen 3 so you may need to compare with the multiplication method above to see which is better
The other fast way is SSE2
x86 SIMD has an operation that takes the high bit of every byte (or float or double) in a vector register, and gives it to you as an integer. The instruction for bytes is pmovmskb. This can of course do 16 bytes at a time with the same number of instructions, so it gets better than the multiply trick if you have lots of this to do.
#include <immintrin.h>
inline uint8_t pack8bools_SSE2(const bool* a)
{
__m128i v = _mm_loadl_epi64( (const __m128i*)a ); // 8-byte load, despite the pointer type.
// __m128 v = _mm_cvtsi64_si128( uint64 ); // alternative if you already have an 8-byte integer
v = _mm_slli_epi32(v, 7); // low bit of each byte becomes the highest
return _mm_movemask_epi8(v);
}
There isn't a single instruction to unpack until AVX-512, which has mask-to-vector instructions. It is doable with SIMD, but likely not as efficiently as the multiply trick. See Convert 16 bits mask to 16 bytes mask and more generally is there an inverse instruction to the movemask instruction in intel avx2? for unpacking bitmaps to other element sizes.
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD has some answers specifically for 8-bits -> 8-bytes, but if you can't do 16 bits at a time for that direction, the multiply trick is probably better, and pext certainly is (except on CPUs where it's disastrously slow, like AMD before Zen 3).
#include <stdint.h> // to get the uint8_t type
uint8_t GetByteFromBools(const bool eightBools[8])
{
uint8_t ret = 0;
for (int i=0; i<8; i++) if (eightBools[i] == true) ret |= (1<<i);
return ret;
}
void DecodeByteIntoEightBools(uint8_t theByte, bool eightBools[8])
{
for (int i=0; i<8; i++) eightBools[i] = ((theByte & (1<<i)) != 0);
}
bool a,b,c,d,e,f,g,h;
//do stuff
char y= a<<7 | b<<6 | c<<5 | d<<4 | e <<3 | f<<2 | g<<1 | h;//merge
although you are probably better off using a bitset
http://www.cplusplus.com/reference/stl/bitset/bitset/
I'd like to note that type punning through unions is UB in C++ (as rodrigo does in his answer. The safest way to do that is memcpy()
struct Bits
{
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
};
unsigned char toByte(Bits b){
unsigned char ret;
memcpy(&ret, &b, 1);
return ret;
}
As others have said, the compiler is smart enough to optimize out memcpy().
BTW, this is the way that Boost does type punning.
There is no way to pack 8 bool variables into one byte. There is a way packing 8 logical true/false states in a single byte using Bitmasking.
You would use the bitwise shift operation and casting to archive it. a function could work like this:
unsigned char toByte(bool *bools)
{
unsigned char byte = \0;
for(int i = 0; i < 8; ++i) byte |= ((unsigned char) bools[i]) << i;
return byte;
}
Thanks Christian Rau for the correction s!