A bit column shift

A bit column shift - c++

How can I shift a column in 8x8 area? For example, I have this one 64-bit unsigned integer as follows:
#include <boost/cstdint.hpp>
int main()
{
/** In binary:
*
* 10000000
* 10000000
* 10000000
* 10000000
* 00000010
* 00000010
* 00000010
* 00000010
*/
boost::uint64_t b = 0x8080808002020202;
}
Now, I want to shift the first vertical row let say four times, after which it becomes this:
/** In binary:
*
* 00000000
* 00000000
* 00000000
* 00000000
* 10000010
* 10000010
* 10000010
* 10000010
*/
b == 0x82828282;
Can this be done relatively fast with only bit-wise operators, or what?

My best guess is this:
(((b & 0x8080808080808080) >> 4 * 8) | (b & ~0x8080808080808080)
The idea is to isolate the column bits and shift only them.

Can this be done relatively fast with only bit-wise operators, or what?
Yes.
How you do it will depend on how "generic" you want to make the solution. Always first column? Always shift by 4?
Here's an idea:
The first 4 bytes represent the top 4 rows. Exploit that, loop over the top 4.
Mask out the first column using 0x8, to see if the bit is set.
Shift that bit over by 4 bytes (>>4), of course it'll need to be in a uint64 to do that.
biwise-or (|) it against the new byte.
You can probably do better, by avoiding looping and writing more code.

There might be a SIMD instruction for this. You'd have to turn on those instructions in your VC++ settings, and of course they won't work on architectures other than AMD/Intel processors.

In this case, you want to split the value into two pieces, the first column and the other columns. Shift the first column by the appropriate amount, then combine them back together.
b = ((b & 0x8080808080808080)) >> (8*4) | (b & 0x7f7f7f7f7f7f7f7f)

Complete guess since I don't have a compiler nor Boost libs available:
Given b, col (counting 1 to 8 from right), and shift (distance of shift)
In your example, col would be 8 and shift would be 4.
boost::uint64_t flags = 0x0101010101010101;
boost::uint64_t mask = flags << (col -1);
boost::int64_t eraser = -1 ^ flags;
boost::uint64_t data = b & mask;
data = data >> (8*shift)
b = (b & eraser) | data;

Related

Counting bits of ones in Byte by time Complexity O(1) C++ code

I've searched an algorithm that counts the number of ones in Byte by time complexity of O(1)
and what I found in google:
// C++ implementation of the approach
#include <bits/stdc++.h>
using namespace std;
int BitsSetTable256[256];
// Function to initialise the lookup table
void initialize()
{
// To initially generate the
// table algorithmically
BitsSetTable256[0] = 0;
for (int i = 0; i < 256; i++)
{
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
}
}
// Function to return the count
// of set bits in n
int countSetBits(int n)
{
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
}
// Driver code
int main()
{
// Initialise the lookup table
initialize();
int n = 9;
cout << countSetBits(n);
}
I understand what I need 256 size of the array (in other words size of the look up table) for indexing from 0 to 255 which they are all the decimals value that Byte represents !
but in the function initialize I didn't understand the terms inside the for loop:
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
Why Im doing that?! I didn't understand what's the purpose of this row code inside the for loop.
In addition , in the function countSetBits , this function returns:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
I didn't understand at all what Im doing and bitwise with 0xff and why Im doing right shift ..
may please anyone explain to me the concept?! I didn't understand at all why in function countSetBits at BitsSetTable256[n >> 24] we didn't do and wise by 0xff ?
I understand why I need the lookup table with size 2^8 , but the other code rows that I mentioned above didn't understand, could anyone please explain them to me in simple words? and what's purpose for counting the number of ones in Byte?
thanks alot guys!

Concerning the first part of question:
// Function to initialise the lookup table
void initialize()
{
// To initially generate the
// table algorithmically
BitsSetTable256[0] = 0;
for (int i = 0; i < 256; i++)
{
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
}
}
This is a neat kind of recursion. (Please, note I don't mean "recursive function" but recursion in a more mathematical sense.)
The seed is BitsSetTable256[0] = 0;
Then every element is initialized using the (already existing) result for i / 2 and adds 1 or 0 for this. Thereby,
1 is added if the last bit of index i is 1
0 is added if the last bit of index i is 0.
To get the value of last bit of i, i & 1 is the usual C/C++ bit mask trick.
Why is the result of BitsSetTable256[i / 2] a value to built upon?
The result of BitsSetTable256[i / 2] is the number of all bits of i the last one excluded.
Please, note that i / 2 and i >> 1 (the value (or bits) shifted to right by 1 whereby the least/last bit is dropped) are equivalent expressions (for positive numbers in the resp. range – edge cases excluded).
Concerning the other part of the question:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
n & 0xff masks out the upper bits isolating the lower 8 bits.
(n >> 8) & 0xff shifts the value of n 8 bits to right (whereby the 8 least bits are dropped) and then again masks out the upper bits isolating the lower 8 bits.
(n >> 16) & 0xff shifts the value of n 16 bits to right (whereby the 16 least bits are dropped) and then again masks out the upper bits isolating the lower 8 bits.
(n >> 24) & 0xff shifts the value of n 24 bits to right (whereby the 24 least bits are dropped) which should make effectively the upper 8 bits the lower 8 bits.
Assuming that int and unsigned have usually 32 bits on nowadays common platforms this covers all bits of n.
Please, note that the right shift of a negative value is implementation-defined.
(I recalled Bitwise shift operators to be sure.)
So, a right-shift of a negative value may fill all upper bits with 1s.
That can break BitsSetTable256[n >> 24] resulting in (n >> 24) > 256 and hence BitsSetTable256[n >> 24] an out of bound access.
The better solution would've been:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[(n >> 24) & 0xff]);

BitsSetTable256[0] = 0;
...
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
The above code seeds the look-up table where each index contains the number of ones for the number used as index and works as:
(i & 1) gives 1 for odd numbers, otherwise 0.
An even number will have as many binary 1 as that number divided by 2.
An odd number will have one more binary 1 than that number divided by 2.
Examples:
if i==8 (1000b) then (i & 1) + BitsSetTable256[i / 2] ->
0 + BitsSetTable256[8 / 2] = 0 + index 4 (0100b) = 0 + 1 .
if i==7 (0111b) then 1 + BitsSetTable256[7 / 2] = 1 + BitsSetTable256[3] = 1 + index 3 (0011b) = 1 + 2.
If you want some formal mathematical proof why this is so, then I'm not the right person to ask, I'd poke one of the math sites for that.
As for the shift part, it's just the normal way of splitting up a 32 bit value in 4x8, portably without care about endianess (any other method to do that is highly questionable). If we un-sloppify the code, we get this:
BitsSetTable256[(n >> 0) & 0xFFu] +
BitsSetTable256[(n >> 8) & 0xFFu] +
BitsSetTable256[(n >> 16) & 0xFFu] +
BitsSetTable256[(n >> 24) & 0xFFu] ;
Each byte is shifted into the LS byte position, then masked out with a & 0xFFu byte mask.
Using bit shifts on int is however code smell and potentially buggy. To avoid poorly-defined behavior, you need to change the function to this:
#include <stdint.h>
uint32_t countSetBits (uint32_t n);

The code in countSetBits takes an int as an argument; apparently 32 bits are assumed. The implementation there is extracting four single bytes from n by shifting and masking; for these four separated bytes, the lookup is used and the number of bits per byte there are added to yield the result.
The initialization of the lookup table is a bit more tricky and can be seen as a form of dynamic programming. The entries are filled in increasing index of the argument. The first expression masks out the least significant bit and counts it; the second expression halves the argument (which could be also done by shifting). The resulting argument is smaller; it is then correctly assumed that the necessary value for the smaller argument is already available in the lookup table.
For the access to the lookup table, consider the following example:
input value (contains 5 ones):
01010000 00000010 00000100 00010000
input value, shifting is not necessary
masked with 0xff (11111111)
00000000 00000000 00000000 00010000 (contains 1 one)
input value shifted by 8
00000000 01010000 00000010 00000100
and masked with 0xff (11111111)
00000000 00000000 00000000 00000100 (contains 1 one)
input value shifted by 16
00000000 00000000 01010000 00000010
and masked with 0xff (11111111)
00000000 00000000 00000000 00000010 (contains 1 one)
input value shifted by 24,
masking is not necessary
00000000 00000000 00000000 01010000 (contains 2 ones)
The extracted values have only the lowermost 8 bits set, which means that the corresponding entries are available in the lookup table. The entries from the lookuptable are added. The underlying idea is that the number of ones in in the argument can be calculated byte-wise (in fact, any partition in bitstrings would be suitable).

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

How to get the bit length of an integer in C++? [duplicate]

This question already has answers here:
Minimum number of bits to represent a given `int`
(9 answers)
Closed 4 months ago.
This question is not a duplicate of Count the number of set bits in a 32-bit integer. See comment by Daniel S. below.
--
Let's say there is a variable int x;. Its size is 4 bytes, i.e. 32 bits.
Then I assign a value to this variable, x = 4567 (in binary 10001 11010111), so in memory it looks like this:
00000000 00000000 00010001 11010111
Is there a way to get the length of the bits which matter. In my example, the length of bits is 13 (I marked them with bold).
If I use sizeof(x) it returns 4, i.e. 4 bytes, which is the size of the whole int. How do I get the minimum number of bits required to represent the integer without the leading 0s?

unsigned bits, var = (x < 0) ? -x : x;
for(bits = 0; var != 0; ++bits) var >>= 1;
This should do it for you.

Warning: math ahead. If you are squeamish, skip ahead to the TL;DR.
What you are really looking for is the highest bit that is set. Let's write out what the binary number 10001 11010111 actually means:
x = 1 * 2^(12) + 0 * 2^(11) + 0 * 2^(10) + ... + 1 * 2^1 + 1 * 2^0
where * denotes multiplication and ^ is exponentiation.
You can write this as
2^12 * (1 + a)
where 0 < a < 1 (to be precise, a = 0/2 + 0/2^2 + ... + 1/2^11 + 1/2^12).
If you take the logarithm (base 2), let's denote it by log2, of this number you get
log2(2^12 * (1 + a)) = log2(2^12) + log2(1 + a) = 12 + b.
Since a < 1 we can conclude that 1 + a < 2 and therefore b < 1.
In other words, if you take the log2(x) and round it down you will get the most significant power of 2 (in this case, 12). Since the powers start counting at 0, the number of bits is one more than this power, namely 13. So:
TL;DR:
The minimum number of bits needed to represent the number x is given by
numberOfBits = floor(log2(x)) + 1

You're looking for the most significant bit that's set in the number. Let's ignore negative numbers for a second. How can we find it? Well, let's see how many bits we need to set to zero before the whole number is zero.
00000000 00000000 00010001 11010111
00000000 00000000 00010001 11010110
^
00000000 00000000 00010001 11010100
^
00000000 00000000 00010001 11010000
^
00000000 00000000 00010001 11010000
^
00000000 00000000 00010001 11000000
^
00000000 00000000 00010001 11000000
^
00000000 00000000 00010001 10000000
^
...
^
00000000 00000000 00010000 00000000
^
00000000 00000000 00000000 00000000
^
Done! After 13 bits, we've cleared them all. Now how do we do this? Well, the expression 1<< pos is the 1 bit shifted over pos positions. So we can check if (x & (1<<pos)) and if true, remove it: x -= (1<<pos). We can also do this in one operation: x &= ~(1<<pos). ~ gets us the complement: all ones with the pos bit set to zero instead of the other way around. x &= y copies the zero bits of y into x.
Now how do we deal with signed numbers? The easiest is to just ignore it: unsigned xu = x;

Many processors provide an instruction for calculating the number of leading zero bits directly (e.g. x86 has lzcnt / bsr and ARM has clz). Usually C++ compilers provide an intrinsic for accessing one of these instructions. The number of leading zeros can then be used to calculate the bit length.
In GCC, the intrinsic is called __builtin_clz. It counts the number of leading zeros for a 32 bit integer.
However, there is one caveat about __builtin_clz. When the input is 0, then the result is undefined. Therefor we need to take care of this special case. This is done in the following function with (x == 0) ? 32 : ..., which gives the result 32 when x is 0:
uint32_t count_of_leading_0_bits(const uint32_t &x) {
return (x == 0) ? 32 : __builtin_clz(x);
}
The bit length can then be calculated from the number of leading zeros:
uint32_t bitlen(const uint32_t &x) {
return 32 - count_of_leading_0_bits(x);
}
Note that other C++ compilers have different intrinsics for counting the number of leading zero bits, but you can find them quickly with a search on the internet. Here is How to use MSVC intrinsics to get the equivalent of this GCC code? for an equivalent with MSVC.

The portable modern way since C++20 should probably use std::countl_zero, like
#include <bit>
int bit_length(unsigned x)
{
return (8*sizeof x) - std::countl_zero(x);
}
Both gcc and clang emit a single bsr instruction on x86 for this code (with a branch on zero), so it should be pretty much optimal.
Note that std::countl_zero only accepts unsigned arguments though, so deciding how to handle your original int parameter is left as an exercise for the reader.

Unsigned integer into little endian form [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
convert big endian to little endian in C [without using provided func]
I'm having trouble with this one part: If I wanted to take a 32 bit number, and I want to shift its bytes (1 byte = 8 bits) from big endian to little endian form. For example:
Lets say I have the number 1.
In 32 bits this is what it would look like:
1st byte 2nd byte 3rd byte 4th byte
00000000 00000000 00000000 00000001
I want it so that it looks like this:
4th byte 3rd byte 2nd byte 1st byte
00000001 00000000 00000000 00000000
so that the byte with the least significant value appears first. I was thinking you can use a for loop, but I'm not exactly sure on how to shift bits/bytes in C++. For example if a user entered in 1 and I had to shift it's bits like the above example, I'm not sure how I would convert 1 into bits, then shift. Could anyone point me in the right direction? Thanks!

<< and >> is the bitwise shift operators in C and most other C style languages.
One way to do what you want is:
int value = 1;
uint x = (uint)value;
int valueShifted =
( x << 24) | // Move 4th byte to 1st
((x << 8) & 0x00ff0000) | // Move 2nd byte to 3rd
((x >> 8) & 0x0000ff00) | // Move 3rd byte to 2nd
( x >> 24); // Move 4th byte to 1st

uint32_t n = 0x00000001;
std::reverse( (char*)&n, (char*)(&n + 1) );
assert( n == 0x01000000 );

Shifting is done with the << and >> operators. Together with the bit-wise AND (&) and OR (|) operators you can do what you want:
int value = 1;
int shifted = value << 24 | (value & 0x0000ff00) << 8 | (value & 0x00ff0000) >> 8 | (value & 0xff000000) >> 24;

How to read/write arbitrary bits in C/C++

Assuming I have a byte b with the binary value of 11111111
How do I for example read a 3 bit integer value starting at the second bit or write a four bit integer value starting at the fifth bit?

Some 2+ years after I asked this question I'd like to explain it the way I'd want it explained back when I was still a complete newb and would be most beneficial to people who want to understand the process.
First of all, forget the "11111111" example value, which is not really all that suited for the visual explanation of the process. So let the initial value be 10111011 (187 decimal) which will be a little more illustrative of the process.
1 - how to read a 3 bit value starting from the second bit:
___ <- those 3 bits
10111011
The value is 101, or 5 in decimal, there are 2 possible ways to get it:
mask and shift
In this approach, the needed bits are first masked with the value 00001110 (14 decimal) after which it is shifted in place:
___
10111011 AND
00001110 =
00001010 >> 1 =
___
00000101
The expression for this would be: (value & 14) >> 1
shift and mask
This approach is similar, but the order of operations is reversed, meaning the original value is shifted and then masked with 00000111 (7) to only leave the last 3 bits:
___
10111011 >> 1
___
01011101 AND
00000111
00000101
The expression for this would be: (value >> 1) & 7
Both approaches involve the same amount of complexity, and therefore will not differ in performance.
2 - how to write a 3 bit value starting from the second bit:
In this case, the initial value is known, and when this is the case in code, you may be able to come up with a way to set the known value to another known value which uses less operations, but in reality this is rarely the case, most of the time the code will know neither the initial value, nor the one which is to be written.
This means that in order for the new value to be successfully "spliced" into byte, the target bits must be set to zero, after which the shifted value is "spliced" in place, which is the first step:
___
10111011 AND
11110001 (241) =
10110001 (masked original value)
The second step is to shift the value we want to write in the 3 bits, say we want to change that from 101 (5) to 110 (6)
___
00000110 << 1 =
___
00001100 (shifted "splice" value)
The third and final step is to splice the masked original value with the shifted "splice" value:
10110001 OR
00001100 =
___
10111101
The expression for the whole process would be: (value & 241) | (6 << 1)
Bonus - how to generate the read and write masks:
Naturally, using a binary to decimal converter is far from elegant, especially in the case of 32 and 64 bit containers - decimal values get crazy big. It is possible to easily generate the masks with expressions, which the compiler can efficiently resolve during compilation:
read mask for "mask and shift": ((1 << fieldLength) - 1) << (fieldIndex - 1), assuming that the index at the first bit is 1 (not zero)
read mask for "shift and mask": (1 << fieldLength) - 1 (index does not play a role here since it is always shifted to the first bit
write mask : just invert the "mask and shift" mask expression with the ~ operator
How does it work (with the 3bit field beginning at the second bit from the examples above)?
00000001 << 3
00001000 - 1
00000111 << 1
00001110 ~ (read mask)
11110001 (write mask)
The same examples apply to wider integers and arbitrary bit width and position of the fields, with the shift and mask values varying accordingly.
Also note that the examples assume unsigned integer, which is what you want to use in order to use integers as portable bit-field alternative (regular bit-fields are in no way guaranteed by the standard to be portable), both left and right shift insert a padding 0, which is not the case with right shifting a signed integer.
Even easier:
Using this set of macros (but only in C++ since it relies on the generation of member functions):
#define GETMASK(index, size) ((((size_t)1 << (size)) - 1) << (index))
#define READFROM(data, index, size) (((data) & GETMASK((index), (size))) >> (index))
#define WRITETO(data, index, size, value) ((data) = (((data) & (~GETMASK((index), (size)))) | (((value) << (index)) & (GETMASK((index), (size))))))
#define FIELD(data, name, index, size) \
inline decltype(data) name() const { return READFROM(data, index, size); } \
inline void set_##name(decltype(data) value) { WRITETO(data, index, size, value); }
You could go for something as simple as:
struct A {
uint bitData;
FIELD(bitData, one, 0, 1)
FIELD(bitData, two, 1, 2)
};
And have the bit fields implemented as properties you can easily access:
A a;
a.set_two(3);
cout << a.two();
Replace decltype with gcc's typeof pre-C++11.

You need to shift and mask the value, so for example...
If you want to read the first two bits, you just need to mask them off like so:
int value = input & 0x3;
If you want to offset it you need to shift right N bits and then mask off the bits you want:
int value = (intput >> 1) & 0x3;
To read three bits like you asked in your question.
int value = (input >> 1) & 0x7;

just use this and feelfree:
#define BitVal(data,y) ( (data>>y) & 1) /** Return Data.Y value **/
#define SetBit(data,y) data |= (1 << y) /** Set Data.Y to 1 **/
#define ClearBit(data,y) data &= ~(1 << y) /** Clear Data.Y to 0 **/
#define TogleBit(data,y) (data ^=BitVal(y)) /** Togle Data.Y value **/
#define Togle(data) (data =~data ) /** Togle Data value **/
for example:
uint8_t number = 0x05; //0b00000101
uint8_t bit_2 = BitVal(number,2); // bit_2 = 1
uint8_t bit_1 = BitVal(number,1); // bit_1 = 0
SetBit(number,1); // number = 0x07 => 0b00000111
ClearBit(number,2); // number =0x03 => 0b0000011

You have to do a shift and mask (AND) operation.
Let b be any byte and p be the index (>= 0) of the bit from which you want to take n bits (>= 1).
First you have to shift right b by p times:
x = b >> p;
Second you have to mask the result with n ones:
mask = (1 << n) - 1;
y = x & mask;
You can put everything in a macro:
#define TAKE_N_BITS_FROM(b, p, n) ((b) >> (p)) & ((1 << (n)) - 1)

"How do I for example read a 3 bit integer value starting at the second bit?"
int number = // whatever;
uint8_t val; // uint8_t is the smallest data type capable of holding 3 bits
val = (number & (1 << 2 | 1 << 3 | 1 << 4)) >> 2;
(I assumed that "second bit" is bit #2, i. e. the third bit really.)

To read bytes use std::bitset
const int bits_in_byte = 8;
char myChar = 's';
cout << bitset<sizeof(myChar) * bits_in_byte>(myChar);
To write you need to use bit-wise operators such as & ^ | & << >>. make sure to learn what they do.
For example to have 00100100 you need to set the first bit to 1, and shift it with the << >> operators 5 times. if you want to continue writing you just continue to set the first bit and shift it. it's very much like an old typewriter: you write, and shift the paper.
For 00100100: set the first bit to 1, shift 5 times, set the first bit to 1, and shift 2 times:
const int bits_in_byte = 8;
char myChar = 0;
myChar = myChar | (0x1 << 5 | 0x1 << 2);
cout << bitset<sizeof(myChar) * bits_in_byte>(myChar);

int x = 0xFF; //your number - 11111111
How do I for example read a 3 bit integer value starting at the second bit
int y = x & ( 0x7 << 2 ) // 0x7 is 111
// and you shift it 2 to the left

If you keep grabbing bits from your data, you might want to use a bitfield. You'll just have to set up a struct and load it with only ones and zeroes:
struct bitfield{
unsigned int bit : 1
}
struct bitfield *bitstream;
then later on load it like this (replacing char with int or whatever data you are loading):
long int i;
int j, k;
unsigned char c, d;
bitstream=malloc(sizeof(struct bitfield)*charstreamlength*sizeof(char));
for (i=0; i<charstreamlength; i++){
c=charstream[i];
for(j=0; j < sizeof(char)*8; j++){
d=c;
d=d>>(sizeof(char)*8-j-1);
d=d<<(sizeof(char)*8-1);
k=d;
if(k==0){
bitstream[sizeof(char)*8*i + j].bit=0;
}else{
bitstream[sizeof(char)*8*i + j].bit=1;
}
}
}
Then access elements:
bitstream[bitpointer].bit=...
or
...=bitstream[bitpointer].bit
All of this is assuming are working on i86/64, not arm, since arm can be big or little endian.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js