Fast way to determine right most nth bit set in a 64 bit

Fast way to determine right most nth bit set in a 64 bit - c++

I try to determine the right most nth bit set
if (value & (1 << 0)) { return 0; }
if (value & (1 << 1)) { return 1; }
if (value & (1 << 2)) { return 2; }
...
if (value & (1 << 63)) { return 63; }
if comparison needs to be done 64 times. Is there any faster way?

If you're using GCC, use the __builtin_ctz or __builtin_ffs function. (http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fffs-2894)
If you're using MSVC, use the _BitScanForward function. See How to use MSVC intrinsics to get the equivalent of this GCC code?.
In POSIX there's also a ffs function. (http://linux.die.net/man/3/ffs)

There's a little trick for this:
value & -value
This uses the twos' complement integer representation of negative numbers.
Edit: This doesn't quite give the exact result as given in the question. The rest can be done with a small lookup table.

You could use a loop:
unsigned int value;
unsigned int temp_value;
const unsigned int BITS_IN_INT = sizeof(int) / CHAR_BIT;
unsigned int index = 0;
// Make a copy of the value, to alter.
temp_value = value;
for (index = 0; index < BITS_IN_INT; ++index)
{
if (temp_value & 1)
{
break;
}
temp_value >>= 1;
}
return index;
This takes up less code space than the if statement proposal, with similar functionality.

KennyTM's suggestions are good if your compiler supports them. Otherwise, you can speed it up using a binary search, something like:
int result = 0;
if (!(value & 0xffffffff)) {
result += 32;
value >>= 32;
}
if (!(value & 0xffff)) {
result += 16;
value >>= 16;
}
and so on. This will do 6 comparisons (in general, log(N) comparisons, versus N for a linear search).

b = n & (-n) // finds the bit
b -= 1; // this gives 1's to the right
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x5555555555555555) + ((b>>1) & 0x5555555555555555); // 2 bit sums of 1 bit numbers
b = (b & 0x3333333333333333) + ((b>>2) & 0x3333333333333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f0f0f0f0f) + ((b>>4) & 0x0f0f0f0f0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff00ff00ff) + ((b>>8) & 0x00ff00ff00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff0000ffff) + ((b>>16) & 0x0000ffff0000ffff); // 32 bit sums of 16 bit numbers
b = (b & 0x00000000ffffffff) + ((b>>32) & 0x00000000ffffffff); // sum of 32 bit numbers
b &= 63; // otherwise I think an input of 0 would produce 64 for a result.
This is in C of course.

Here's another method that takes advantage of short-circuit with logical AND operations and conditional instruction execution or the instruction pipeline.
unsigned int value;
unsigned int temp_value = value;
bool bit_found = false;
unsigned int index = 0;
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 0
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 1
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 2
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 3
//...
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 64
return index - 1; // The -1 may not be necessary depending on the starting bit number.
The advantage to this method is that there are no branches and the instruction pipeline is not disturbed. This is very fast on processors that perform conditional execution of instructions.

Works for Visual C++ 6
int toErrorCodeBit(__int64 value) {
const int low_double_word = value;
int result = 0;
__asm
{
bsf eax, low_double_word
jz low_double_value_0
mov result, eax
}
return result;
low_double_value_0:
const int upper_double_word = value >> 32;
__asm
{
bsf eax, upper_double_word
mov result, eax
}
result += 32;
return result;
}

Related

Bit-twiddle / bit-operation hacks: Count number of places to MSB in C or C++ [duplicate]

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).

GCC has:
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.
A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.
To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)
There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).
(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD

This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}

Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));

unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?

Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}

A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")

thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}

Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}

I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409

Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}

Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.

Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}

Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.

This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}

There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).

The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.

Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.

Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.

Gettin last non zero bit on the left [duplicate]

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD

This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}

Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));

unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?

Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}

A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")

thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}

Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}

I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409

Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}

Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.

Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}

Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.

This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}

There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).

The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.

Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.

Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.

turning off leftmost non zero bit of a number

how can I turn off leftmost non-zero bit of a number in O(1)?
for example
n = 366 (base 10) = 101101110 (in base 2)
then after turning the leftmost non-zero bit off ,number looks like = 001101110
n will always be >0

Well, if you insist on O(1) under any circumstances, the Intel Intrinsics function _bit_scan_reverse() defined in immintrin.h does a hardware find for the most-significant non-zero bit in a int number.
Though the operation does use a loop (functional equivalent), I believe its constant time given its latency at fixed 3 (as per Intel Intrinsics Guide).
The function will return the index to the most-significant non-zero bit thus doing a simple:
n = n & ~(1 << _bit_scan_reverse(n));
should do.
This intrinsic is undefined for n == 0. So you gotta watch out there. I'm following the assumption of your original post where n > 0.

n = 2^x + y.
x = log(n) base 2
Your highest set bit is x.
So in order to reset that bit,
number &= ~(1 << x);
Another approach:
int highestOneBit(int i) {
i |= (i >> 1);
i |= (i >> 2);
i |= (i >> 4);
i |= (i >> 8);
i |= (i >> 16);
return i - (i >> 1);
}
int main() {
int n = 32767;
int z = highestOneBit(n); // returns the highest set bit number i.e 2^x.
cout<< (n&(~z)); // Resets the highest set bit.
return 0;
}

Check out this question, for a possibly faster solution, using a processor instruction.
However, an O(lgN) solution is:
int cmsb(int x)
{
unsigned int count = 0;
while (x >>= 1) {
++count;
}
return x & ~(1 << count);
}

If ANDN is not supported and LZCNT is supported, the fastest O(1) way to do it is not something along the lines of n = n & ~(1 << _bit_scan_reverse(n)); but rather...
int reset_highest_set_bit(int x)
{
const int mask = 0x7FFFFFFF; // 011111111[...]
return x & (mask >> __builtin_clz(x));
}

Extract n most significant non-zero bits from int in C++ without loops

I want to extract the n most significant bits from an integer in C++ and convert those n bits to an integer.
For example
int a=1200;
// its binary representation within 32 bit word-size is
// 00000000000000000000010010110000
Now I want to extract the 4 most significant digits from that representation, i.e. 1111
00000000000000000000010010110000
^^^^
and convert them again to an integer (1001 in decimal = 9).
How is possible with a simple c++ function without loops?

Some processors have an instruction to count the leading binary zeros of an integer, and some compilers have instrinsics to allow you to use that instruction. For example, using GCC:
uint32_t significant_bits(uint32_t value, unsigned bits) {
unsigned leading_zeros = __builtin_clz(value);
unsigned highest_bit = 32 - leading_zeros;
unsigned lowest_bit = highest_bit - bits;
return value >> lowest_bit;
}
For simplicity, I left out checks that the requested number of bits are available. For Microsoft's compiler, the intrinsic is called __lzcnt.
If your compiler doesn't provide that intrinsic, and you processor doesn't have a suitable instruction, then one way to count the zeros quickly is with a binary search:
unsigned leading_zeros(int32_t value) {
unsigned count = 0;
if ((value & 0xffff0000u) == 0) {
count += 16;
value <<= 16;
}
if ((value & 0xff000000u) == 0) {
count += 8;
value <<= 8;
}
if ((value & 0xf0000000u) == 0) {
count += 4;
value <<= 4;
}
if ((value & 0xc0000000u) == 0) {
count += 2;
value <<= 2;
}
if ((value & 0x80000000u) == 0) {
count += 1;
}
return count;
}

It's not fast, but (int)(log(x)/log(2) + .5) + 1 will tell you the position of the most significant non-zero bit. Finishing the algorithm from there is fairly straight-forward.

This seems to work (done in C# with UInt32 then ported so apologies to Bjarne):
unsigned int input = 1200;
unsigned int most_significant_bits_to_get = 4;
// shift + or the msb over all the lower bits
unsigned int m1 = input | input >> 8 | input >> 16 | input >> 24;
unsigned int m2 = m1 | m1 >> 2 | m1 >> 4 | m1 >> 6;
unsigned int m3 = m2 | m2 >> 1;
unsigned int nbitsmask = m3 ^ m3 >> most_significant_bits_to_get;
unsigned int v = nbitsmask;
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -((int)v);
if (v>0) c--;
if ((v & 0x0000FFFF) >0) c -= 16;
if ((v & 0x00FF00FF) >0) c -= 8;
if ((v & 0x0F0F0F0F) >0 ) c -= 4;
if ((v & 0x33333333) >0) c -= 2;
if ((v & 0x55555555) >0) c -= 1;
unsigned int result = (input & nbitsmask) >> c;
I assumed you meant using only integer math.
I used some code from #OliCharlesworth's link, you could remove the conditionals too by using the LUT for trailing zeroes code there.

How to determine how many bytes an integer needs?

I'm looking for the most efficient way to calculate the minimum number of bytes needed to store an integer without losing precision.
e.g.
int: 10 = 1 byte
int: 257 = 2 bytes;
int: 18446744073709551615 (UINT64_MAX) = 8 bytes;
Thanks
P.S. This is for a hash functions which will be called many millions of times
Also the byte sizes don't have to be a power of two
The fastest solution seems to one based on tronics answer:
int bytes;
if (hash <= UINT32_MAX)
{
if (hash < 16777216U)
{
if (hash <= UINT16_MAX)
{
if (hash <= UINT8_MAX) bytes = 1;
else bytes = 2;
}
else bytes = 3;
}
else bytes = 4;
}
else if (hash <= UINT64_MAX)
{
if (hash < 72057594000000000ULL)
{
if (hash < 281474976710656ULL)
{
if (hash < 1099511627776ULL) bytes = 5;
else bytes = 6;
}
else bytes = 7;
}
else bytes = 8;
}
The speed difference using mostly 56 bit vals was minimal (but measurable) compared to Thomas Pornin answer. Also i didn't test the solution using __builtin_clzl which could be comparable.

Use this:
int n = 0;
while (x != 0) {
x >>= 8;
n ++;
}
This assumes that x contains your (positive) value.
Note that zero will be declared encodable as no byte at all. Also, most variable-size encodings need some length field or terminator to know where encoding stops in a file or stream (usually, when you encode an integer and mind about size, then there is more than one integer in your encoded object).

You need just two simple ifs if you are interested on the common sizes only. Consider this (assuming that you actually have unsigned values):
if (val < 0x10000) {
if (val < 0x100) // 8 bit
else // 16 bit
} else {
if (val < 0x100000000L) // 32 bit
else // 64 bit
}
Should you need to test for other sizes, choosing a middle point and then doing nested tests will keep the number of tests very low in any case. However, in that case making the testing a recursive function might be a better option, to keep the code simple. A decent compiler will optimize away the recursive calls so that the resulting code is still just as fast.

Assuming a byte is 8 bits, to represent an integer x you need [log2(x) / 8] + 1 bytes where [x] = floor(x).
Ok, I see now that the byte sizes aren't necessarily a power of two. Consider the byte sizes b. The formula is still [log2(x) / b] + 1.
Now, to calculate the log, either use lookup tables (best way speed-wise) or use binary search, which is also very fast for integers.

The function to find the position of the first '1' bit from the most significant side (clz or bsr) is usually a simple CPU instruction (no need to mess with log2), so you could divide that by 8 to get the number of bytes needed. In gcc, there's __builtin_clz for this task:
#include <limits.h>
int bytes_needed(unsigned long long x) {
int bits_needed = sizeof(x)*CHAR_BIT - __builtin_clzll(x);
if (bits_needed == 0)
return 1;
else
return (bits_needed + 7) / 8;
}
(On MSVC you would use the _BitScanReverse intrinsic.)

You may first get the highest bit set, which is the same as log2(N), and then get the bytes needed by ceil(log2(N) / 8).
Here are some bit hacks for getting the position of the highest bit set, which are copied from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious, and you can click the URL for details of how these algorithms work.
Find the integer log base 2 of an integer with an 64-bit IEEE float
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
Find the log base 2 of an integer with a lookup table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
unsigned int v; // 32-bit word to find the log of
unsigned r; // r will be lg(v)
register unsigned int t, tt; // temporaries
if (tt = v >> 16)
{
r = (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
}
else
{
r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
}
Find the log base 2 of an N-bit integer in O(lg(N)) operations
unsigned int v; // 32-bit value to find the log2 of
const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};
const unsigned int S[] = {1, 2, 4, 8, 16};
int i;
register unsigned int r = 0; // result of log2(v) will go here
for (i = 4; i >= 0; i--) // unroll for speed...
{
if (v & b[i])
{
v >>= S[i];
r |= S[i];
}
}
// OR (IF YOUR CPU BRANCHES SLOWLY):
unsigned int v; // 32-bit value to find the log2 of
register unsigned int r; // result of log2(v) will go here
register unsigned int shift;
r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
// OR (IF YOU KNOW v IS A POWER OF 2):
unsigned int v; // 32-bit value to find the log2 of
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
0xFF00FF00, 0xFFFF0000};
register unsigned int r = (v & b[0]) != 0;
for (i = 4; i > 0; i--) // unroll for speed...
{
r |= ((v & b[i]) != 0) << i;
}

Find the number of bits by taking the log2 of the number, then divide that by 8 to get the number of bytes.
You can find logn of x by the formula:
logn(x) = log(x) / log(n)
Update:
Since you need to do this really quickly, Bit Twiddling Hacks has several methods for quickly calculating log2(x). The look-up table approach seems like it would suit your needs.

This will get you the number of bytes. It's not strictly the most efficient, but unless you're programming a nanobot powered by the energy contained in a red blood cell, it won't matter.
int count = 0;
while (numbertotest > 0)
{
numbertotest >>= 8;
count++;
}

You could write a little template meta-programming code to figure it out at compile time if you need it for array sizes:
template<unsigned long long N> struct NBytes
{ static const size_t value = NBytes<N/256>::value+1; };
template<> struct NBytes<0>
{ static const size_t value = 0; };
int main()
{
std::cout << "short = " << NBytes<SHRT_MAX>::value << " bytes\n";
std::cout << "int = " << NBytes<INT_MAX>::value << " bytes\n";
std::cout << "long long = " << NBytes<ULLONG_MAX>::value << " bytes\n";
std::cout << "10 = " << NBytes<10>::value << " bytes\n";
std::cout << "257 = " << NBytes<257>::value << " bytes\n";
return 0;
}
output:
short = 2 bytes
int = 4 bytes
long long = 8 bytes
10 = 1 bytes
257 = 2 bytes
Note: I know this isn't answering the original question, but it answers a related question that people will be searching for when they land on this page.

Floor((log2(N) / 8) + 1) bytes

You need exactly the log function
nb_bytes = floor(log(x)/log(256))+1
if you use log2, log2(256) == 8 so
floor(log2(x)/8)+1

You need to raise 256 to successive powers until the result is larger than your value.
For example: (Tested in C#)
long long limit = 1;
int byteCount;
for (byteCount = 1; byteCount < 8; byteCount++) {
limit *= 256;
if (limit > value)
break;
}
If you only want byte sizes to be powers of two (If you don't want 65,537 to return 3), replace byteCount++ with byteCount *= 2.

I think this is a portable implementation of the straightforward formula:
#include <limits.h>
#include <math.h>
#include <stdio.h>
int main(void) {
int i;
unsigned int values[] = {10, 257, 67898, 140000, INT_MAX, INT_MIN};
for ( i = 0; i < sizeof(values)/sizeof(values[0]); ++i) {
printf("%d needs %.0f bytes\n",
values[i],
1.0 + floor(log(values[i]) / (M_LN2 * CHAR_BIT))
);
}
return 0;
}
Output:
10 needs 1 bytes
257 needs 2 bytes
67898 needs 3 bytes
140000 needs 3 bytes
2147483647 needs 4 bytes
-2147483648 needs 4 bytes
Whether and how much the lack of speed and the need to link floating point libraries depends on your needs.

I know this question didn't ask for this type of answer but for those looking for a solution using the smallest number of characters, this does the assignment to a length variable in 17 characters, or 25 including the declaration of the length variable.
//Assuming v is the value that is being counted...
int l=0;
for(;v>>l*8;l++);

This is based on SoapBox's idea of creating a solution that contains no jumps, branches etc... Unfortunately his solution was not quite correct. I have adopted the spirit and here's a 32bit version, the 64bit checks can be applied easily if desired.
The function returns number of bytes required to store the given integer.
unsigned short getBytesNeeded(unsigned int value)
{
unsigned short c = 0; // 0 => size 1
c |= !!(value & 0xFF00); // 1 => size 2
c |= (!!(value & 0xFF0000)) << 1; // 2 => size 3
c |= (!!(value & 0xFF000000)) << 2; // 4 => size 4
static const int size_table[] = { 1, 2, 3, 3, 4, 4, 4, 4 };
return size_table[c];
}

For each of eight times, shift the int eight bits to the right and see if there are still 1-bits left. The number of times you shift before you stop is the number of bytes you need.
More succinctly, the minimum number of bytes you need is ceil(min_bits/8), where min_bits is the index (i+1) of the highest set bit.

There are a multitude of ways to do this.
Option #1.
int numBytes = 0;
do {
numBytes++;
} while (i >>= 8);
return (numBytes);
In the above example, is the number you are testing, and generally works for any processor, any size of integer.
However, it might not be the fastest. Alternatively, you can try a series of if statements ...
For a 32 bit integers
if ((upper = (value >> 16)) == 0) {
/* Bit in lower 16 bits may be set. */
if ((high = (value >> 8)) == 0) {
return (1);
}
return (2);
}
/* Bit in upper 16 bits is set */
if ((high = (upper >> 8)) == 0) {
return (3);
}
return (4);
For 64 bit integers, Another level of if statements would be required.
If the speed of this routine is as critical as you say, it might be worthwhile to do this in assembler if you want it as a function call. That could allow you to avoid creating and destroying the stack frame, saving a few extra clock cycles if it is that critical.

A bit basic, but since there will be a limited number of outputs, can you not pre-compute the breakpoints and use a case statement? No need for calculations at run-time, only a limited number of comparisons.

Why not just use a 32-bit hash?
That will work at near-top-speed everywhere.
I'm rather confused as to why a large hash would even be wanted. If a 4-byte hash works, why not just use it always? Excepting cryptographic uses, who has hash tables with more then 232 buckets anyway?

there are lots of great recipes for stuff like this over at Sean Anderson's "Bit Twiddling Hacks" page.

This code has 0 branches, which could be faster on some systems. Also on some systems (GPGPU) its important for threads in the same warp to execute the same instructions. This code is always the same number of instructions no matter what the input value.
inline int get_num_bytes(unsigned long long value) // where unsigned long long is the largest integer value on this platform
{
int size = 1; // starts at 1 sot that 0 will return 1 byte
size += !!(value & 0xFF00);
size += !!(value & 0xFFFF0000);
if (sizeof(unsigned long long) > 4) // every sane compiler will optimize this out
{
size += !!(value & 0xFFFFFFFF00000000ull);
if (sizeof(unsigned long long) > 8)
{
size += !!(value & 0xFFFFFFFFFFFFFFFF0000000000000000ull);
}
}
static const int size_table[] = { 1, 2, 4, 8, 16 };
return size_table[size];
}
g++ -O3 produces the following (verifying that the ifs are optimized out):
xor %edx,%edx
test $0xff00,%edi
setne %dl
xor %eax,%eax
test $0xffff0000,%edi
setne %al
lea 0x1(%rdx,%rax,1),%eax
movabs $0xffffffff00000000,%rdx
test %rdx,%rdi
setne %dl
lea (%rdx,%rax,1),%rax
and $0xf,%eax
mov _ZZ13get_num_bytesyE10size_table(,%rax,4),%eax
retq

Why so complicated? Here's what I came up with:
bytesNeeded = (numBits/8)+((numBits%8) != 0);
Basically numBits divided by eight + 1 if there is a remainder.

There are already a lot of answers here, but if you know the number ahead of time, in c++ you can use a template to make use of the preprocessor.
template <unsigned long long N>
struct RequiredBytes {
enum : int { value = 1 + (N > 255 ? RequiredBits<(N >> 8)>::value : 0) };
};
template <>
struct RequiredBytes<0> {
enum : int { value = 1 };
};
const int REQUIRED_BYTES_18446744073709551615 = RequiredBytes<18446744073709551615>::value; // 8
or for a bits version:
template <unsigned long long N>
struct RequiredBits {
enum : int { value = 1 + RequiredBits<(N >> 1)>::value };
};
template <>
struct RequiredBits<1> {
enum : int { value = 1 };
};
template <>
struct RequiredBits<0> {
enum : int { value = 1 };
};
const int REQUIRED_BITS_42 = RequiredBits<42>::value; // 6

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js