Related
I have an array of 100 uint8_t's, which is to be treated as a stream of 800 bits, and dealt with 7 bits at a time. So in other words, if the first element of the 8-bit array holds 0b11001100 and the second holds ob11110000 then when I come to read it in 7-bit format, the first element of the 7-bit array would be 0b1100110 and the second would be 0b0111100 with the remaining 2 bits being held in the 3rd.
The first thing I tried was a union...
struct uint7_t {
uint8_t i1:7;
};
union uint7_8_t {
uint8_t u8[100];
uint7_t u7[115];
};
but of course everything's byte aligned and I essentially end up simply loosing the 8th bit of each element.
Does anyone have any idea's on how I can go about doing this?
Just to be clear, this is something of a visual representation of the result of the union:
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 32 bits of 8 bit data
0xxxxxxx 0xxxxxxx 0xxxxxxx 0xxxxxxx 32 bits of 7-bit data.
And this represents what it is that I want to do instead:
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 32 bits of 8 bit data
xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxx 32 bits of 7-bit data.
I'm aware the last bits may be padded but that's fine, I just want someway of accessing each byte 7 bits at a time without losing any of the 800 bits. So far the only way I can think of is lots of bit shifting, which of course would work but I'm sure there's a cleaner way of going about it(?)
Thanks in advance for any answers.
Not sure what you mean by "cleaner". Generally people who work on this sort of problem regularly consider shifting and masking to be the right primitive tool to use. One can do something like defining a bitstream abstraction with a method to read an arbitrary number of bits off the stream. This abstraction sometimes shows up in compression applications. The internals of the method of course do use shifting and masking.
One fairly clean approach is to write a function which extracts a 7-bit number at any bit index in an array of unsigned char's. Use a division to convert the bit index to a byte index, and modulus to get the bit index within the byte. Then shift and mask. The input bits can span two bytes, so you either have to glue together a 16-bit value before extraction, or do two smaller extractions and or them together to construct the result.
If I were aiming for something moderately performant, I'd likely take one of two approaches:
The first has two state variables saying how many bits to take from the current and next byte. It would use shifting, masking, and bitwise or, to produce the current output (a number between 0 and 127 as an int for example), then the loop would update both state variables via adding and modulus, and would increment the current byte pointers if all bits in the first byte were consumed.
The second approach is to load 56-bits (8 outputs worth of input) into a 64-bit integer and use a fully unrolled structure to extract each of the 8 outputs. Doing this without using unaligned memory reads requires constructing the 64-bit integer piecemeal. (56-bits is special because the starting bit position is byte aligned.)
To go real fast, I might try writing SIMD code in Halide. That's beyond scope here I believe. (And not clear it is going to win much actually.)
Designs which read more than one byte into a integer at a time will likely have to consider processor byte ordering.
Process them in groups of 8 (since 8x7 nicely rounds to something 8bit aligned). Bitwise operators are the order of the day here. Faffing around with the last (upto) 7 numbers is a little faffy, but not impossible. (This code assumes these are unsigned 7 bit integers! Signed conversion would require you to do consider flipping the top bit if bit[6] is 1)
// convert 8 x 7bit ints in one go
void extract8(const uint8_t input[7], uint8_t output[8])
{
output[0] = input[0] & 0x7F;
output[1] = (input[0] >> 7) | ((input[1] << 1) & 0x7F);
output[2] = (input[1] >> 6) | ((input[2] << 2) & 0x7F);
output[3] = (input[2] >> 5) | ((input[3] << 3) & 0x7F);
output[4] = (input[3] >> 4) | ((input[4] << 4) & 0x7F);
output[5] = (input[4] >> 3) | ((input[5] << 5) & 0x7F);
output[6] = (input[5] >> 2) | ((input[6] << 6) & 0x7F);
output[7] = input[6] >> 1;
}
// convert array of 7bit ints to 8bit
void seven_bit_to_8bit(const uint8_t* const input, uint8_t* const output, const size_t count)
{
size_t count8 = count >> 3;
for(size_t i = 0; i < count8; ++i)
{
extract8(input + 7 * i, output + 8 * i);
}
// handle remaining (upto) 7 bytes
const size_t countr = (count % 8);
if(countr)
{
// how many bytes do we need to copy from the input?
size_t remaining_bits = 7 * countr;
if(remaining_bits % 8)
{
// round to next nearest multiple of 8
remaining_bits += (8 - remaining_bits % 8);
}
remaining_bits /= 8;
{
uint8_t in[7] = {0}, out[8] = {0};
for(size_t i = 0; i < remaining_bits; ++i)
{
in[i] = input[count8 * 7 + i];
}
extract8(in, out);
for(size_t i = 0; i < countr; ++i)
{
output[count8 * 8 + i] = in[i];
}
}
}
}
Here is a solution that uses the vector bool specialization. It also uses a similar mechanism to allow access to the seven-bit elements via reference objects.
The member functions allow for the following operations:
uint7_t x{5}; // simple value
Arr<uint7_t> arr(10); // array of size 10
arr[0] = x; // set element
uint7_t y = arr[0]; // get element
arr.push_back(uint7_t{9}); // add element
arr.push_back(x); //
std::cout << "Array size is "
<< arr.size() << '\n'; // get size
for(auto&& i : arr)
std::cout << i << '\n'; // range-for to read values
int z{50};
for(auto&& i : arr)
i = z++; // range-for to change values
auto&& v = arr[1]; // get reference to second element
v = 99; // change second element via reference
Full program:
#include <vector>
#include <iterator>
#include <iostream>
struct uint7_t {
unsigned int i : 7;
};
struct seven_bit_ref {
size_t begin;
size_t end;
std::vector<bool>& bits;
seven_bit_ref& operator=(const uint7_t& right)
{
auto it{bits.begin()+begin};
for(int mask{1}; mask != 1 << 7; mask <<= 1)
*it++ = right.i & mask;
return *this;
}
operator uint7_t() const
{
uint7_t r{};
auto it{bits.begin() + begin};
for(int i{}; i < 7; ++i)
r.i += *it++ << i;
return r;
}
seven_bit_ref operator*()
{
return *this;
}
void operator++()
{
begin += 7;
end += 7;
}
bool operator!=(const seven_bit_ref& right)
{
return !(begin == right.begin && end == right.end);
}
seven_bit_ref operator=(int val)
{
uint7_t temp{};
temp.i = val;
operator=(temp);
return *this;
}
};
template<typename T>
class Arr;
template<>
class Arr<uint7_t> {
public:
Arr(size_t size) : bits(size * 7, false) {}
seven_bit_ref operator[](size_t index)
{
return {index * 7, index * 7 + 7, bits};
}
size_t size()
{
return bits.size() / 7;
}
void push_back(uint7_t val)
{
for(int mask{1}; mask != 1 << 7; mask <<= 1){
bits.push_back(val.i & mask);
}
}
seven_bit_ref begin()
{
return {0, 7, bits};
}
seven_bit_ref end()
{
return {size() * 7, size() * 7 + 7, bits};
}
std::vector<bool> bits;
};
std::ostream& operator<<(std::ostream& os, uint7_t val)
{
os << val.i;
return os;
}
int main()
{
uint7_t x{5}; // simple value
Arr<uint7_t> arr(10); // array of size 10
arr[0] = x; // set element
uint7_t y = arr[0]; // get element
arr.push_back(uint7_t{9}); // add element
arr.push_back(x); //
std::cout << "Array size is "
<< arr.size() << '\n'; // get size
for(auto&& i : arr)
std::cout << i << '\n'; // range-for to read values
int z{50};
for(auto&& i : arr)
i = z++; // range-for to change values
auto&& v = arr[1]; // get reference
v = 99; // change via reference
std::cout << "\nAfter changes:\n";
for(auto&& i : arr)
std::cout << i << '\n';
}
The following code works as you have asked for it, but first the output and live example on ideone.
Output:
Before changing values...:
7 bit representation: 1111111 0000000 0000000 0000000 0000000 0000000 0000000 0000000
8 bit representation: 11111110 00000000 00000000 00000000 00000000 00000000 00000000
After changing values...:
7 bit representation: 1000000 1001100 1110010 1011010 1010100 0000111 1111110 0000000
8 bit representation: 10000001 00110011 10010101 10101010 10000001 11111111 00000000
8 Bits: 11111111 to ulong: 255
7 Bits: 1111110 to ulong: 126
After changing values...:
7 bit representation: 0010000 0101010 0100000 0000000 0000000 0000000 0000000 0000000
8 bit representation: 00100000 10101001 00000000 00000000 00000000 00000000 00000000
It is very straight forward using a std::bitset in a class called BitVector. I implement one getter and setter. The getter returns also a std::bitset at the given index selIdx with a given template argument size M. The given idx will be multiplied by the given size M to get the right position. The returned bitset can also be converted to numerical or string values.
The setter uses an uint8_t value as input and again the index selIdx. The bits will be shifted to the right position into the bitset.
Further you can use the getter and setter with different sizes because of the template argument M, which means you can work with either 7 or 8 bit representation but also 3 or what ever you like.
I'm sure this code is not the best concerning speed, but I think it is a very clear and clean solution. Also it is not complete at all as there are just one getter, one setter and two constructors. Remember to implement error checking concerning indexes and sizes.
Code:
#include <iostream>
#include <bitset>
template <size_t N> class BitVector
{
private:
std::bitset<N> _data;
public:
BitVector (unsigned long num) : _data (num) { };
BitVector (const std::string& str) : _data (str) { };
template <size_t M>
std::bitset<M> getBits (size_t selIdx)
{
std::bitset<M> retBitset;
for (size_t idx = 0; idx < M; ++idx)
{
retBitset |= (_data[M * selIdx + idx] << (M - 1 - idx));
}
return retBitset;
}
template <size_t M>
void setBits (size_t selIdx, uint8_t num)
{
const unsigned char* curByte = reinterpret_cast<const unsigned char*> (&num);
for (size_t bitIdx = 0; bitIdx < 8; ++bitIdx)
{
bool bitSet = (1 == ((*curByte & (1 << (8 - 1 - bitIdx))) >> (8 - 1 - bitIdx)));
_data.set(M * selIdx + bitIdx, bitSet);
}
}
void print_7_8()
{
std:: cout << "\n7 bit representation: ";
for (size_t idx = 0; idx < (N / 7); ++idx)
{
std::cout << getBits<7>(idx) << " ";
}
std:: cout << "\n8 bit representation: ";
for (size_t idx = 0; idx < N / 8; ++idx)
{
std::cout << getBits<8>(idx) << " ";
}
}
};
int main ()
{
BitVector<56> num = 127;
std::cout << "Before changing values...:";
num.print_7_8();
num.setBits<8>(0, 0x81);
num.setBits<8>(1, 0b00110011);
num.setBits<8>(2, 0b10010101);
num.setBits<8>(3, 0xAA);
num.setBits<8>(4, 0x81);
num.setBits<8>(5, 0xFF);
num.setBits<8>(6, 0x00);
std::cout << "\n\nAfter changing values...:";
num.print_7_8();
std::cout << "\n\n8 Bits: " << num.getBits<8>(5) << " to ulong: " << num.getBits<8>(5).to_ulong();
std::cout << "\n7 Bits: " << num.getBits<7>(6) << " to ulong: " << num.getBits<7>(6).to_ulong();
num = BitVector<56>(std::string("1001010100000100"));
std::cout << "\n\nAfter changing values...:";
num.print_7_8();
return 0;
}
Here is one approach without the manual shifting. This is just a crude POC, but hopefully you will be able to get something out of it. I don't know if you are able to easily transform your input into bitset, but i think it should be possible.
int bytes = 0x01234567;
bitset<32> bs(bytes);
cout << "Input: " << bs << endl;
for(int i = 0; i < 5; i++)
{
bitset<7> slice(bs.to_string().substr(i*7, 7));
cout << slice << endl;
}
Also this is probably much less performant then the bitshifting version, so i wouldn't recommend it for heavy lifting.
You can use this to get the index'th 7-bit element from in (note that it doesn't have proper end of array handling). Simple, fast.
int get7(const uint8_t *in, int index) {
int fidx = index*7;
int idx = fidx>>3;
int sidx = fidx&7;
return (in[idx]>>sidx|in[idx+1]<<(8-sidx))&0x7f;
}
You can use direct access or bulk bit packing/unpacking as in TurboPFor:Integer Compression
// Direct read access
// b : bit width 0-16 (7 in your case)
#define bzhi32(u,b) ((u) & ((1u <<(b))-1))
static inline unsigned bitgetx16(unsigned char *in,
unsigned idx,
unsigned b) {
unsigned bidx = b*idx;
return bzhi32( *(unsigned *)((uint16_t *)in+(bidx>>4)) >> (bidx& 0xf), b );
}
If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).
GCC has:
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.
A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.
To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)
There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).
(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)
Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.
Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.
This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD
This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}
Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.
Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));
unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit
What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?
Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}
Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.
I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}
A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")
thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}
Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.
As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}
I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.
c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409
Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}
Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.
Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.
I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.
This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}
There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).
The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1
My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.
Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.
Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.
If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).
GCC has:
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.
A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.
To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)
There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).
(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)
Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.
Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.
This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD
This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}
Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.
Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));
unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit
What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?
Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}
Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.
I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}
A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")
thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}
Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.
As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}
I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.
c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409
Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}
Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.
Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.
I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.
This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}
There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).
The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1
My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.
Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.
Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.
I have an character array (say char charr[5]) which contains 0/1 (char array of boolean number). Now, I want to convert the character array to 64 bit integer number (if array is {0, 0, 0, 1 , 0}, it will give 2 ). How to do that ? Is there any library functions ?
No, there's no standard function for that. But it's pretty trivial:
uint64_t pack(const uint8_t *bits, size_t n)
{
uint64_t x = 0, value = 1 << (n - 1);
while(n > 0)
{
x += value * *bits++;
n--;
value /= 2;
}
return x;
}
Unwind has the basic idea right, but a complex implementation. This also works:
uint64_t pack(const uint8_t *bits, size_t n)
{
uint64 x = 0;
for(;n > 0; n--) // For all input bits.
{
x <<= 1; // make room for next bit.
assert(*bits <= 1); // It better be a 0 or 1.
x += *bits++; // Add new bit on the end.
}
return x;
}
Try strtoll with base 2:
int val = strtoll(input, NULL, 2);
Lets say i have an array dynamically allocated.
int* array=new int[10]
That is 10*4=40 bytes or 10*32=320 bits. I want to read the 2nd bit of the 30th byte or 242nd bit. What is the easiest way to do so? I know I can access the 30th byte using array[30] but accessing individual bits is more tricky.
bool bitset(void const * data, int bitindex) {
int byte = bitindex / 8;
int bit = bitindex % 8;
unsigned char const * u = (unsigned char const *) data;
return (u[byte] & (1<<bit)) != 0;
}
this is working !
#define GET_BIT(p, n) ((((unsigned char *)p)[n/8] >> (n%8)) & 0x01)
int main()
{
int myArray[2] = { 0xaaaaaaaa, 0x00ff00ff };
for( int i =0 ; i < 2*32 ; i++ )
printf("%d", GET_BIT(myArray, i));
return 0;
}
ouput :
0101010101010101010101010101010111111111000000001111111100000000
Be carefull of the endiannes !
First, if you're doing bitwise operations, it's usually
preferable to make the elements an unsigned integral type
(although in this case, it really doesn't make that much
difference). As for accessing the bits: to access bit i in an
array of n int's:
static int const bitsPerWord = sizeof(int) * CHAR_BIT;
assert( i >= 0 && i < n * bitsPerWord );
int wordIndex = i / bitsPerWord;
int bitIndex = i % bitsPerWord;
then to read:
return (array[wordIndex] & (1 << bitIndex)) != 0;
to set:
array[wordIndex] |= 1 << bitIndex;
and to reset:
array[wordIndex] &= ~(1 << bitIndex);
Or you can use bitset, if n is constant, or vector<bool> or
boost::dynamic_bitset if it's not, and let someone else do the
work.
You can use something like this:
!((array[30] & 2) == 0)
array[30] is the integer.
& 2 is an and operation which masks the second bit (2 = 00000010)
== 0 will check if the mask result is 0
! will negate that result, because we're checking if it's 1 not zero....
You need bit operations here...
if(array[5] & 0x1)
{
//the first bit in array[5] is 1
}
else
{
//the first bit is 0
}
if(array[5] & 0x8)
{
//the 4th bit in array[5] is 1
}
else
{
//the 4th bit is 0
}
0x8 is 00001000 in binary. Doing the anding masks all other bits and allows you to see if the bit is 1 or 0.
int is typically 32 bits, so you would need to do some arithmetic to get a certain bit number in the entire array.
EDITED based on comment below - array contains int of 32 bits, not 8 bits uchar.
int pos = 241; // I start at index 0
bool bit242 = (array[pos/32] >> (pos%32)) & 1;