How would you transpose a binary matrix?

How would you transpose a binary matrix? - c++

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

Related

The fastest way to swap the two lowest bits in an unsigned int in C++

Assume that I have:
unsigned int x = 883621;
which in binary is :
00000000000011010111101110100101
I need the fastest way to swap the two lowest bits:
00000000000011010111101110100110
Note: To clarify: If x is 7 (0b111), the output should be still 7.

If you have few bytes of memory to spare, I would start with a lookup table:
constexpr unsigned int table[]={0b00,0b10,0b01,0b11};
unsigned int func(unsigned int x){
auto y = (x & (~0b11)) |( table[x&0b11]);
return y;
}
Quickbench -O3 of all the answers so far.
Quickbench -Ofast of all the answers so far.
(Plus my ifelse naive idea.)
[Feel free to add yourself and edit my answer].
Please do correct me if you believe the benchmark is incorrect, I am not an expert in reading assembly. So hopefully volatile x prevented caching the result between loops.

I'll ignore the top bits for a second - there's a trick using multiplication. Multiplication is really a convolution operation, and you can use that to shuffle bits.
In particular, assume the two lower bits are AB. Multiply that by 0b0101, and you get ABAB. You'll see that the swapped bits BA are the middle bits.
Hence,
x = (x & ~3U) | ((((x&3)*5)>>1)&3)
[edit] The &3 is needed to strip the top A bit, but with std::uint_32_t you can use overflow to lose that bit for free - multiplication then gets you the result BAB0'0000'0000'0000'0000'0000'0000'0000'0000' :
x = (x & ~3U) | ((((x&3)*0xA0000000)>>30));

I would use
x = (x & ~0b11) | ((x & 0b10) >> 1) | ((x & 0b01) << 1);

Inspired by the table idea, but with the table as a simple constant instead of an array. We just need mask(00)==00, mask(01)==11, mask(10)=11, masK(11)==11.
constexpr unsigned int table = 0b00111100;
unsigned int func(unsigned int x) {
auto xormask = (table >> ((x&3) * 2)) &3;
x ^= xormask;
return x;
}
This also uses the xor-trick from dyungwang to avoid isolating the top bits.

Another idea, to avoid stripping the top bits. Assume x has the bits XXXXAB, then we want to x-or it with 0000(A^B)(A^B). Thus
auto t = x^(x>>1); // Last bit is now A^B
t &=1; // take just that bit
t *= 3; // Put in the last two positions
x ^= t; // Change A to B and B to A.

Just looking from a mathematical point of view, I would start with a rotate_left() function, which rotates a list of bits one place to the left (011 becomes 110, then 101, and then back 011), and use this as follows:
int func(int input){
return rotate_left(rotate_left((input / 4))) + rotate_left(input % 4);
}
Using this on the author's example 11010111101110100101:
input = 11010111101110100101;
input / 4 = 110101111011101001;
rotate_left(input / 4) = 1101011110111010010;
rotate_left(rotate_left(input / 4) = 11010111101110100100;
input % 4 = 01;
rotate_left(input % 4) = 10;
return 11010111101110100110;
There is also a shift() function, which can be used (twice!) for replacing the integer division.

Optimize blockwise bit operations: base-4 numbers

This should be a fun question, at least for me.
My intent is to manipulate base-4 numbers, encoded in a unsigned integer. Each two-bits block then represents a single base-4 digit, starting from the least significant bit:
01 00 11 = base4(301)
I'd like to optimize my code using SSE instructions, because I'm not sure how I scored here, maybe poorly.
The code starts from strings (and uses them to check the correctness), and implements:
convert string to binary
convert binary to string
reverse the number
Any hints are more than welcome!
uint32_t tobin(std::string s)
{
uint32_t v, bin = 0;
// Convert to binary
for (int i = 0; i < s.size(); i++)
{
switch (s[i])
{
case '0':
v = 0;
break;
case '3':
v = 3;
break;
case '1':
v = 1;
break;
case '2':
v = 2;
break;
default:
throw "UNKOWN!";
}
bin = bin | (v << (i << 1));
}
return bin;
}
std::string tostr(int size, const uint32_t v)
{
std::string b;
// Convert to binary
for (int i = 0; i < size; i++)
{
uint32_t shl = 0, shr = 0, q;
shl = (3 << (i << 1));
shr = i << 1;
q = v & shl;
q = q >> shr;
unsigned char c = static_cast<char>(q);
switch (c)
{
case 0:
b += '0';
break;
case 3:
b += '3';
break;
case 1:
b += '1';
break;
case 2:
b += '2';
break;
default:
throw "UNKOWN!";
}
}
return b;
}
uint32_t revrs(int size, const uint32_t v)
{
uint32_t bin = 0;
// Convert to binary
for (int i = 0; i < size; i++)
{
uint32_t shl = 0, shr = 0, q;
shl = (3 << (i << 1));
shr = i << 1;
q = v & shl;
q = q >> shr;
unsigned char c = static_cast<char>(q);
shl = (size - i - 1) << 1;
bin = bin | (c << shl);
}
return bin;
}
bool ckrev(std::string s1, std::string s2)
{
std::reverse(s1.begin(), s1.end());
return s1 == s2;
}
int main(int argc, char* argv[])
{
// Binary representation of base-4 number
uint32_t binr;
std::vector<std::string> chk { "123", "2230131" };
for (const auto &s : chk)
{
std::string b, r;
uint32_t c;
binr = tobin(s);
b = tostr(s.size(), binr);
c = revrs(s.size(), binr);
r = tostr(s.size(), c);
std::cout << "orig " << s << std::endl;
std::cout << "binr " << std::hex << binr << " string " << b << std::endl;
std::cout << "revs " << std::hex << c << " string " << r << std::endl;
std::cout << ">>> CHK " << (s == b) << " " << ckrev(r, b) << std::endl;
}
return 0;
}

This is a little challenging with SSE because there is little provision for bit packing (you want to take two bits from every character and pack them contiguously). Anyway, the special instruction _mm_movemask_epi8 can help you.
For the string-to-binary conversion, you can proceed as follows:
load the 16 characters string (pad with zeroes or clear after the load if necessary);
subtract bytewise ASCII zeroes .
compare bytewise 'unsigned greater than' to a string of 16 '3' bytes; this will set bytes 0xFF wherever there is an invalid character
use _mm_movemask_epi8 to detect such a character in the packed short value
If all is fine, you now need to pack the bit pairs. For this you need to
duplicate the 16 bytes
shift the bits of weight 1 and 2, left by 7 or 6 positions, to make them most significant (_mm_sll_epi16. There is no epi8 version, but bits from one element becoming garbage in the low bits of another element isn't important for this.)
interleave them (_mm_unpack..._epi8, once with lo and once with hi)
store the high bits of those two vectors into shorts with _mm_movemask_epi8.
For the binary-to-string conversion, I can't think of an SSE implementation that makes sense, as there is no counterpart of _mm_movemask_epi8 that would allow you to unpack efficiently.

I'll solve the problem of converting 32-bit integer to base4 string on SSE.
The problem of removing leading zeros is not considered, i.e. base4 strings always have length 16.
General throughts
Clearly, we have to extract pairs of bits in vectorized form.
In order to do it, we can perform some byte manipulations and bitwise operations.
Let's see what we can do with SSE:
A single intrinsic _mm_shuffle_epi8 (from SSSE3) allows to shuffle 16 bytes in absolutely any way you desire.
Clearly, some well-structured shuffles and register mixtures can be done with simpler instructions from SSE2,
but it's important to remember that any in-register shuffling can be done with one cheap instruction.
Shuffling does not help to change indices of bits in a byte.
In order to move chunks of bits around, we usually use bit shifts.
Unfortunately, there is no way in SSE to shift different elements of XMM register by different amounts.
As #PeterCorder mentioned in comments, there are such instructions in AVX2 (e.g. _mm_sllv_epi32), but they operate on at least 32-bit granularity.
From the ancient times we are constantly taught that bit shift is fast and multiplication is slow. Today arithmetic is so much accelerated, that it is no longer so. In SSE, shifts and multiplications seem to have equal throughput, although multiplications have more latency.
Using multiplication by powers of two we can shift left different elements of single XMM register by different amounts. There are many instructions like _mm_mulhi_epi16, which allow 16-bit granularity. Also one instruction _mm_maddubs_epi16 allows 8-bit granularity of shifts.
Right shift can be done via left shift just the same way people do division via multiplication: shift left by 16-k, then shift right by two bytes (recall that any byte shuffling is cheap).
We actually want to do 16 different bit shifts. If we use multiplication with 16-bit granularity, then we'll have to use at least two XMM registers for shifting, then they can be merged together. Also, we can try to use multiplication with 8-bit granularity to do everything in a single register.
16-bit granularity
First of all, we have to move 32-bit integer to the lower 4 bytes of XMM register. Then we shuffle bytes so that each 16-bit part of XMM register contains one byte of input:
|abcd|0000|0000|0000| before shuffle (little-endian)
|a0a0|b0b0|c0c0|d0d0| after shuffle (to low halves)
|0a0a|0b0b|0c0c|0d0d| after shuffle (to high halves)
Then we can call _mm_mulhi_epi16 to shift each part right by k = 1..16. Actually, it is more convenient to put input bytes into high halves of 16-bit elements, so that we can shift left by k = -8..7. As a result, we want to see some bytes of XMM register containing the pairs of bits defining some base4 digits (as their lower bits). After that we can remove unnecessary high bits by _mm_and_si128, and shuffle valuable bytes to proper places.
Since only 8 shifts can be done at once with 16-bit granularity, we have to do the shifting part twice. Then we combine the two XMM registers into one.
Below you can see the code using this idea. It a bit optimized: there is no bytes shuffling after the bit shifts.
__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(-1, 0, -1, 0, -1, 1, -1, 1, -1, 2, -1, 2, -1, 3, -1, 3));
__m128i even = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x00100100)); //epi16: 1<<8, 1<<4 x4 times
__m128i odd = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x04004000)); //epi16: 1<<14, 1<<10 x4 times
even = _mm_and_si128(even, _mm_set1_epi16(0x0003));
odd = _mm_and_si128(odd , _mm_set1_epi16(0x0300));
__m128i res = _mm_xor_si128(even, odd);
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);
8-bit granularity
First of all we move our 32-bit integer into XMM register of course. Then we shuffle bytes so that each byte of result equals the input byte containing the two bits wanted at that place:
|abcd|0000|0000|0000| before shuffle (little-endian)
|aaaa|bbbb|cccc|dddd| after shuffle
Now we use _mm_and_si128 to filter bits: at each byte only the two bits wanted must remain. After that we only need to shift each byte right by 0/2/4/6 bits. This should be achieved with intrinsic _mm_maddubs_epi16, which allows to shift 16 bytes at once. Unfortunately, I do not see how to shift all the bytes properly with this instruction only, but at least we can shift each odd byte by 2 bits right (even bytes remain as is). Then the bytes with indices 4k+2 and 4k+3 can be shifted right by 4 bits with single _mm_madd_epi16 instruction.
Here is the resulting code:
__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3));
__m128i twobits = _mm_and_si128(bytes, _mm_set1_epi32(0xC0300C03)); //epi8: 3<<0, 3<<2, 3<<4, 3<<6 x4 times
twobits = _mm_maddubs_epi16(twobits, _mm_set1_epi16(0x4001)); //epi8: 1<<0, 1<<6 x8 times
__m128i res = _mm_madd_epi16(twobits, _mm_set1_epi32(0x10000001)); //epi16: 1<<0, 1<<12 x4 times
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);
P.S.
Both solutions use a lot of compile-time constant 128-bit values. They are not encoded into x86 instructions, so processor has to load them from memory (most likely L1 cache) each time they are used. However, if you are going to run many conversions in a loop, then the compiler would load all these constants into registers before the loop (I hope).
Here you can find the full code (without timing), including implementation of the str2bin solution by #YvesDaoust.

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);

Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.

I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}

To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)

unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}

Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?

Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];

It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.

The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.

You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.

Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

Algorithm for finding the smallest power of two that's greater or equal to a given value [duplicate]

I need to find the smallest power of two that's greater or equal to a given value. So far, I have this:
int value = 3221; // 3221 is just an example, could be any number
int result = 1;
while (result < value) result <<= 1;
It works fine, but feels kind of naive. Is there a better algorithm for that problem?
EDIT. There were some nice assembler suggestions, so I'm adding those tags to the question.
A related question, Rounding up to next power of 2 has some C answers where C++20 std::bit_ceil() isn't available.
Most of the answers to this question predate C++20, but could still be useful if implementing a C++ standard library or compiler.

Here's my favorite. Other than the initial check for whether it's invalid (<0, which you could skip if you knew you'd only have >=0 numbers passed in), it has no loops or conditionals, and thus will outperform most other methods. This is similar to erickson's answer, but I think that my decrementing x at the beginning and adding 1 at the end is a little less awkward than his answer (and also avoids the conditional at the end).
/// Round up to next higher power of 2 (return x if it's already a power
/// of 2).
inline int
pow2roundup (int x)
{
if (x < 0)
return 0;
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x+1;
}

ceil(log2(value))
ilog2() can be calculated in 3 asm instructions e.g., http://www.asterisk.org/doxygen/1.4/log2comp_8h-source.html

On Intel hardware the BSR instruction is close to what you want - it finds the most-significant-set-bit. If you need to be more precise you can then wonder if the remaining bits are precisely zero or not.
I tend to assume that other CPU's will have something like BSR - this is a question you want answered to normalize a number.
If your number is more than 32 bits then you would do a scan from your most-significant-DWORD to find the first DWORD with ANY bits set.
Edsger Dijkstra would likely remark that the above "algorithms" assume that your computer uses Binary Digits, while from his kind of lofty "algorithmic" perspective you should think about Turing machines or something - obviously I am of the more pragmatic style.

In the spirit of Quake II's 0x5f3759df and the Bit Twiddling Hacks' IEEE version - this solution reaches into a double to extract the exponent as a means to calculate floor(lg2(n)). It's a bit faster than the accepted solution and much faster than the Bit Twiddling IEEE version since it avoids floating point math. As coded, it assumes a double is a real*8 IEEE float on a little endian machine.
int nextPow2(int n)
{
if ( n <= 1 ) return n;
double d = n-1;
return 1 << ((((int*)&d)[1]>>20)-1022);
}
Edit: Add optimized x86 assembly version with the help of a co-worker. A 4% speed gain but still about 50% slower than a bsr version (6 sec vs 4 on my laptop for n=1..2^31-2).
int nextPow2(int n)
{
if ( n <= 1 ) return n;
double d;
n--;
__asm {
fild n
mov eax,4
fstp d
mov ecx, dword ptr d[eax]
sar ecx,14h
rol eax,cl
}
}

Here's a template version of the bit shifting technique.
template<typename T> T next_power2(T value)
{
--value;
for(size_t i = 1; i < sizeof(T) * CHAR_BIT; i*=2)
value |= value >> i;
return value+1;
}
Since the loop only uses constants it gets flattened by the compiler. (I checked) The function is also future proof.
Here's one that uses __builtin_clz. (Also future proof)
template<typename T> T next_power2(T value)
{
return 1 << ((sizeof(T) * CHAR_BIT) - __builtin_clz(value-1));
}

Your implementation is not naive, it's actually the logical one, except that it's wrong - it returns a negative for numbers greater that 1/2 the maximum integer size.
Assuming you can restrict numbers to the range 0 through 2^30 (for 32-bit ints), it'll work just fine, and a lot faster than any mathematical functions involving logarithms.
Unsigned ints would work better but you'd end up with an infinite loop (for numbers greater than 2^31) since you can never reach 2^32 with the << operator.

pow ( 2 , ceil( log2(value) );
log2(value) = log(value) / log(2);

An exploration of the possible solutions to closely related problem (that is, rounding down instead of up), many of which are significantly faster than the naive approach, is available on the Bit Twiddling Hacks page, an excellent resource for doing the kinds of optimization you are looking for. The fastest solution is to use a lookup table with 256 entries, that reduces the total operation count to around 7, from an average of 62 (by a similar operation counting methodology) for the naive approach. Adapting those solutions to your problem is a matter of a single comparison and increment.

You don't really say what you mean by "better algorithm" but as the one you present is perfectly clear (if somewhat flawed), I'll assume you are after a more efficient algorithm.
Larry Gritz has given what is probably the most efficient c/c++ algorithm without the overhead of a look up table and it would suffice in most cases (see http://www.hackersdelight.org for similar algorithms).
As mentioned elsewhere most CPUs these days have machine instructions to count the number of leading zeroes (or equivalently return the ms set bit) however their use is non-portable and - in most cases - not worth the effort.
However most compilers have "intrinsic" functions that allow the use of machine instructions but in a more portable way.
Microsoft C++ has _BitScanReverse() and gcc provides __builtin_clz() to do the bulk of the work efficiently.

How about a recursive template version to generate a compile constant:
template<uint32_t A, uint8_t B = 16>
struct Pow2RoundDown { enum{ value = Pow2RoundDown<(A | (A >> B)), B/2>::value }; };
template<uint32_t A>
struct Pow2RoundDown<A, 1> { enum{ value = (A | (A >> 1)) - ((A | (A >> 1)) >> 1) }; };
template<uint32_t A, uint8_t B = 16>
struct Pow2RoundUp { enum{ value = Pow2RoundUp<((B == 16 ? (A-1) : A) | ((B == 16 ? (A-1) : A) >> B)), B/2>::value }; };
template<uint32_t A >
struct Pow2RoundUp<A, 1> { enum{ value = ((A | (A >> 1)) + 1) }; };
Can be used like so:
Pow2RoundDown<3221>::value, Pow2RoundUp<3221>::value

The code below repeatedly strips the lowest bit off until the number is a power of two, then doubles the result unless the number is a power of two to begin with. It has the advantage of running in a time proportional to the number of bits set. Unfortunately, it has the disadvantage of requiring more instructions in almost all cases than either the code in the question or the assembly suggestions. I include it only for completeness.
int nextPow(int x) {
int y = x
while (x &= (x^(~x+1)))
y = x << 1;
return y
}

I know this is downvote-bait, but if the number is small enough (like 8 or 16-bits) a direct lookup might be fastest.
// fill in the table
unsigned short tab[65536];
unsigned short bit = tab[i];
It might be possible to extend it to 32 bits by first doing the high word and then the low.
//
unsigned long bitHigh = ((unsigned long)tab[(unsigned short)(i >> 16)]) << 16;
unsigned long bitLow = 0;
if (bitHigh == 0){
bitLow = tab[(unsigned short)(i & 0xffff)];
}
unsigned long answer = bitHigh | bitLow;
It's probably no better that the shift-or methods, but maybe could be extended to larger word sizes.
(Actually, this gives the highest 1-bit. You'd have to shift it left by 1 to get the next higher power of 2.)

My version of the same:
int pwr2Test(size_t x) {
return (x & (x - 1))? 0 : 1;
}
size_t pwr2Floor(size_t x) {
// A lookup table for rounding up 4 bit numbers to
// the nearest power of 2.
static const unsigned char pwr2lut[] = {
0x00, 0x01, 0x02, 0x02, // 0, 1, 2, 3
0x04, 0x04, 0x04, 0x04, // 4, 5, 6, 7
0x08, 0x08, 0x08, 0x08, // 8, 9, 10, 11
0x08, 0x08, 0x08, 0x08 // 12, 13, 14, 15
};
size_t pwr2 = 0; // The return value
unsigned int i = 0; // The nybble interator
for( i = 0; x != 0; ++i ) { // Iterate through nybbles
pwr2 = pwr2lut[x & 0x0f]; // rounding up to powers of 2.
x >>= 4; // (i - 1) will contain the
} // highest non-zero nybble index.
i = i? (i - 1) : i;
pwr2 <<= (i * 4);
return pwr2;
}
size_t pwr2Size(size_t x) {
if( pwr2Test(x) ) { return x; }
return pwr2Floor(x) * 2;
}

In standard c++20 this is included in <bit>.
The answer is simply
#include <bit>
unsigned long upper_power_of_two(unsigned long v)
{
return std::bit_ceil(v);
}

i love the shift.
i'll settle for
int bufferPow = 1;
while ( bufferPow<bufferSize && bufferPow>0) bufferPow <<= 1;
that way the loop always terminates, and the part after && is evaluated almost never.
And i do not think two lines are worth a function call. Also you can make a long, or short, depending on your judgment, and it is very readable.
(if bufferPow becomes negative, hopefully your main code will exit fast.)
Usually you compute 2-power only once at the start of an algorithm, so optimizing would be silly anyway. However, would be interested if anyone bored enough would care for a speed contest... using the above examples and 255 256 257 .. 4195 4196 4197

An arbitrary log function can be converted to a log base 2 by dividing by the log of 2:
$ /usr/local/pypy-1.9/bin/pypy
Python 2.7.2 (341e1e3821ff, Jun 07 2012, 15:38:48)
[PyPy 1.9.0 with GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``<arigato> yes but there is not
much sense if I explain all about today's greatest idea if tomorrow it's
completely outdated''
>>>> import math
>>>> print math.log(65535)/math.log(2)
15.9999779861
>>>> print math.log(65536)/math.log(2)
16.0
>>>>
It of course won't be 100% precise, since there is floating point arithmetic involved.

This works and is really fast (on my 2.66 GHz Intel Core 2 Duo 64-bit processor).
#include <iostream>
int main(void) {
int testinput,counter;
std::cin >> testinput;
while (testinput > 1) {
testinput = testinput >> 1;
counter++;
}
int finalnum = testinput << counter+1;
printf("Is %i\n",finalnum);
return 0;
}
I tested it on 3, 6, and 65496, and correct answers (4, 8, and 65536) were given.
Sorry if this seems a bit arcane, I was under the influence of a couple of hours of Doom just before writing. :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js