Use bit manipulation to convert a bit from each byte in an 8-byte number to a single byte - c++

I have a 64-bit unsigned integer. I want to check the 6th bit of each byte and return a single byte representing those 6th bits.
The obvious, "brute force" solution is:
inline const unsigned char Get6thBits(unsigned long long num) {
unsigned char byte(0);
for (int i = 7; i >= 0; --i) {
byte <<= 1;
byte |= bool((0x20 << 8 * i) & num);
}
return byte;
}
I could unroll the loop into a bunch of concatenated | statements to avoid the int allocation, but that's still pretty ugly.
Is there a faster, more clever way to do it? Maybe use a bitmask to get the 6th bits, 0x2020202020202020 and then somehow convert that to a byte?

If _pext_u64 is a possibility (this will work on Haswell and newer, it's very slow on Ryzen though), you could write this:
int extracted = _pext_u64(num, 0x2020202020202020);
This is a really literal way to implement it. pext takes a value (first argument) and a mask (second argument), at every position that the mask has a set bit it takes the corresponding bit from the value, and all bits are concatenated.
_mm_movemask_epi8 is more widely usable, you could use it like this:
__m128i n = _mm_set_epi64x(0, num);
int extracted = _mm_movemask_epi8(_mm_slli_epi64(n, 2));
pmovmskb takes the high bit of every byte in its input vector and concatenates them. The bits we want are not the high bit of every byte, so I move them up two positions with psllq (of course you could shift num directly). The _mm_set_epi64x is just some way to get num into a vector.
Don't forget to #include <intrin.h>, and none of this was tested.
Codegen seems reasonable enough
A weirder option is gathering the bits with a multiplication: (only slightly tested)
int extracted = (num & 0x2020202020202020) * 0x08102040810204 >> 56;
The idea here is that num & 0x2020202020202020 only has very few bits set, so we can arrange a product that never carries into bits that we need (or indeed at all). The multiplier is constructed to do this:
a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000 +
0b0000000c0000000d0000000e0000000f0000000g0000000h00000000000000 +
00c0000000d0000000e0000000f0000000g0000000h000000000000000000000 etc..
Then the top byte will have all the bits "compacted" together. The lower bytes actually have something like that too, but they're missing bits that would have to come from "higher" (bits can only move to the left in a multiplication).

Related

How to get char from first bits per byte in uint?

I have uint64_t variable with some value (for example 0x700a06fffff48517). I want to get char with the first bit of each byte in the uint (so from 0x700a06fffff48517 I want 0b00011110). Is there a better way than this?
#include <inttypes>
char getFirstBits(uint64_t x) {
x >>= 7; // shift to put first bits to last bits in byte
char c = 0;
for (size_t i = 0; i < 8; i++) {
c <<= 1;
c |= x & 1;
x >>= 8;
}
return c;
}
The fastest I can think of on (recent) x86 is
#include <immintrin.h>
uint8_t getFirstBits(uint64_t val) {
return _pext_u64(val, 0x8080808080808080ULL);
}
This is a generic solution that doesn't depend on any CPU architectures
char getFirstBits(uint64_t x) {
x = (ntohll(x) >> 7) & 0x0101010101010101; // get the first bits
return 0x8040201008040201*x >> 56; // move them together
}
This is basically the multiplication technique where bits are moved around using a single multiplication with a magic number. The remaining bitwise operations are for removing the unnecessary bits. ntohll should be htobe64 on *nix. For more details about that technique and what the magic number means read
How to create a byte out of 8 bool values (and vice versa)?
What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?
You can also use SIMD to do it:
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
It found immintrin.h, but it cannot find _pext_u64 (it found _pext_u32), I guess it's because I'm on 32-bit windows. However, when I use _pext_u32 to process both halves of uint64, it crashes with unknown instruction (seems like my processor doesn't have the instruction).
PEXT is a new instruction in the BMI2 extension, so if your CPU doesn't support BMI2 then you can't use it. In 32-bit mode only the 32-bit version of PEXT is supported, that's why _pext_u64 doesn't work

C/C++ bit array resolution transform algorithms

Anyone aware of any algorithms to up/down convert bit arrays?
ie: when the resolution is 1/16:
every 1 bit = 16 bits. (low resolution to high resolution)
1010 -> 1111111111111111000000000000000011111111111111110000000000000000
and reverse, 16 bits = 1 bit (high resolution to low resolution)
1111111111111111000000000000000011111111111111110000000000000000 -> 1010
Right now I am looping bit by bit which is not efficient. Using a whole 64-bit word would be better but run into issues when the word isn't divisible by resolution equally (some bits may spill over to the next word).
C++:
std::vector<uint64_t> bitset;
C:
uint64_t *bitset = calloc(total_bits >> 6, sizeof(uint64_t)); // free() when done
which is accessed using:
const uint64_t idx = bit >> 6;
const uint64_t pos = bit % 64;
const bool value = (bitset[idx] >> pos) & 1U;
and set/clear:
bitset[idx] |= (1UL << pos);
bitset[idx] &= ~(1UL << pos);
and the OR (or AND/XOR/AND/NOT) of two bitsets of same resolution are done using the full 64-bit word:
bitset[idx] |= source.bitset[idx];
I am dealing with large enough bitsets (2+ billion bits) that I'm looking for any efficiency in the loops. One way I found to optimize the loop is to check each word using __builtin_popcountll, and skip ahead in the loop:
for (uint64_t bit = 0; bit < total_bits; bit++)
{
const uint64_t idx = bit >> 6;
const uint64_t pos = bit % 64;
const uint64_t bits = __builtin_popcountll(bitset[idx]);
if (!bits)
{
i += 63;
continue;
}
// process
}
I'm looking for algorithms/techniques more than code examples. But if you have code to share, I won't say no. Any academic research papers would be appreciated too.
Thanks in advance!
Does the resolution always between 1/2 and 1/64? Or even 1/32? Because if you need very long sequence, you might need more loop nesting which might cause some slow down.
Are you sequence always very long (millions of bits) or this is a maximum but usually your sequences are shorter? When doing high to low resolution, can you assume that data is valid or not.
Here are some tricks:
uint64_t one = 1;
uint64_t n_one_bits = (one << n) - 1u; // valid for 0 to 63; not sure for 64
If your sequence are so long, you might want to check if n is some power of 2 and have more optimized code for those cases.
You might find some other useful tricks here:
https://graphics.stanford.edu/~seander/bithacks.html
So if your resolution is 1/16, you don't need to loop individual 16 bits but you can check all 16 bits at once. Then you can repeat for next group again and again.
If the number is not an a divider of 64, you can shift bits as appropriate each time you would cross the 64 bits boundary. Say, that your resolution is 1/5, then you could process 60 bits, then shift 4 remaining bit and combine with following 60 bits.
If you can assume that data is valid, then you don't even need to shift the original number as you can pick the value of the appropriate bit each time.

How to build N bits variables in C++?

I am dealing with very large list of booleans in C++, around 2^N items of N booleans each. Because memory is critical in such situation, i.e. an exponential growth, I would like to build a N-bits long variable to store each element.
For small N, for example 24, I am just using unsigned long int. It takes 64MB ((2^24)*32/8/1024/1024). But I need to go up to 36. The only option with build-in variable is unsigned long long int, but it takes 512GB ((2^36)*64/8/1024/1024/1024), which is a bit too much.
With a 36-bits variable, it would work for me because the size drops to 288GB ((2^36)*36/8/1024/1024/1024), which fits on a node of my supercomputer.
I tried std::bitset, but std::bitset< N > creates a element of at least 8B.
So a list of std::bitset< 1 > is much greater than a list of unsigned long int.
It is because the std::bitset just change the representation, not the container.
I also tried boost::dynamic_bitset<> from Boost, but the result is even worst (at least 32B!), for the same reason.
I know an option is to write all elements as one chain of booleans, 2473901162496 (2^36*36), then to store then in 38654705664 (2473901162496/64) unsigned long long int, which gives 288GB (38654705664*64/8/1024/1024/1024). Then to access an element is just a game of finding in which elements the 36 bits are stored (can be either one or two). But it is a lot of rewriting of the existing code (3000 lines) because mapping becomes impossible and because adding and deleting items during the execution in some functions will be surely complicated, confusing, challenging, and the result will be most likely not efficient.
How to build a N-bits variable in C++?
How about a struct with 5 chars (and perhaps some fancy operator overloading as needed to keep it compatible to the existing code)? A struct with a long and a char probably won't work because of padding / alignment...
Basically your own mini BitSet optimized for size:
struct Bitset40 {
unsigned char data[5];
bool getBit(int index) {
return (data[index / 8] & (1 << (index % 8))) != 0;
}
bool setBit(int index, bool newVal) {
if (newVal) {
data[index / 8] |= (1 << (index % 8));
} else {
data[index / 8] &= ~(1 << (index % 8));
}
}
};
Edit: As geza has also pointed out int he comments, the "trick" here is to get as close as possible to the minimum number of bytes needed (without wasting memory by triggering alignment losses, padding or pointer indirection, see http://www.catb.org/esr/structure-packing/).
Edit 2: If you feel adventurous, you could also try a bit field (and please let us know how much space it actually consumes):
struct Bitset36 {
unsigned long long data:36;
}
I'm not an expert, but this is what I would "try". Find the bytes for the smallest type your compiler supports (should be char). You can check with sizeof and you should get 1. That means 1 byte, so 8 bits.
So if you wanted a 24 bit type...you would need 3 chars. For 36 you would need 5 char array and you would have 4 bits of wasted padding on the end. This could easily be accounted for.
i.e.
char typeSize[3] = {0}; // should hold 24 bits
Now make a bit mask to access each position of typeSize.
const unsigned char one = 0b0000'0001;
const unsigned char two = 0b0000'0010;
const unsigned char three = 0b0000'0100;
const unsigned char four = 0b0000'1000;
const unsigned char five = 0b0001'0000;
const unsigned char six = 0b0010'0000;
const unsigned char seven = 0b0100'0000;
const unsigned char eight = 0b1000'0000;
Now you can use the bit-wise or to set the values to 1 where needed..
typeSize[1] |= four;
*typeSize[0] |= (four | five);
To turn off bits use the & operator..
typeSize[0] &= ~four;
typeSize[2] &= ~(four| five);
You can read the position of each bit with the & operator.
typeSize[0] & four
Bear in mind, I don't have a compiler handy to try this out so hopefully this is a useful approach to your problem.
Good luck ;-)
You can use array of unsigned long int and store and retrieve needed bit chains with bitwise operations. This approach excludes space overhead.
Simplified example for unsigned byte array B[] and 12-bit variables V (represented as ushort):
Set V[0]:
B[0] = V & 0xFF; //low byte
B[1] = B[1] & 0xF0; // clear low nibble
B[1] = B[1] | (V >> 8); //fill low nibble of the second byte with the highest nibble of V

Get bits from byte

I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?
(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.
I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}
pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.

Convert a 64bit integer to an array of 7bit-characters

Say I have a function vector<unsigned char> byteVector(long long UID), returning a byte presentation of the UID, a 64bit integer, as a vector. This vector is later on used to write this data to a file.
Now, because I decided I want to read that file with Python, I have to comply to the utf-8 standard, which means I can only use the first 7bits of each char. If the highest significant bit is 1 I can't decode it to a string anymore, because those are only supporting ASCII-characters. Also, I'll have to pass those strings to other processes via a Command Line Interface, which also is only supporting the ASCII-character set.
Before that problem arose, my approach on splitting the 64bit integer up into 8 separate bytes was the following, which worked really great:
vector<unsigned char> outputVector = vector<unsigned char>();
unsigned char * uidBytes = (unsigned char*) &UID_;
for (int i = 0; i < 8; i++){
outputVector.push_back(uidBytes[i]);
}
Of course that doesn't work anymore, as the constrain "HBit may not be 1" limits the maximum value of each unsigned char to 127.
My easiest option now would of course be to replace the one push_back call with this:
outputVector.push_back(uidBytes[i] / 128);
outputVector.push_back(uidBytes[i] % 128);
But this seems kind of wasteful, as the first of each unsigned char pair can only be 0 or 1 and I would be wasting some space (6 bytes) I could otherwise use.
As I need to save 64 bits, and can use 7 bits per byte, I'll need 64//7 + 64%7 = 10 bytes.
It isn't really much (none of the files I write ever even reached the 1kB mark), but I was using 8 bytes before and it seems a bit wasteful to use 16 now when ten (not 9, I'm sorry) would suffice. So:
How do I convert a 64bit integer to a vector of ten 7-bit integers?
This is probably too much optimization, but there could be some very cool solution for this problem (probably using shift operators) and I would be really interested in seeing it.
You can use bit shifts to take 7-bit pieces of the 64-bit integer. However, you need ten 7-bit integers, nine is not enough: 9 * 7 = 63, one bit short.
std::uint64_t uid = 42; // Your 64-bit input here.
std::vector<std::uint8_t> outputVector;
for (int i = 0; i < 10; i++)
{
outputVector.push_back(uid >> (i * 7) & 0x7f);
}
In every iteration, we shift the input bits by a multiple of 7, and mask out a 7-bit part. The most significant bit of the 8-bit numbers will be zero. Note that the numbers in the vector are “reversed”: the least significant bits have the lowest index. This is irrelevant though, if you decode the parts in the correct way. Decoding can be done as follows:
std::uint64_t decoded = 0;
for (int i = 0; i < 10; i++)
{
decoded |= static_cast<std::uint64_t>(outputVector[i]) << (i * 7);
}
Please note that it seems like a bad idea to interpret the resulting vector as UTF-8 encoded text: the sequence can still contain control characters and and \0. If you want to encode your 64-bit integer in printable characters, take a look at base64. In that case, you will need one more character (eleven in total) to encode 64 bits.
I suggest using assembly language.
Many assembly languages have instructions for shifting a bit into a "spare" carry bit and shifting the carry bit into a register. The C language has no convenient or efficient method to do this.
The algorithm:
for i = 0; i < 7; ++i
{
right shift 64-bit word into carry.
right shift carry into character.
}
You should also look into using std::bitset.