Is there are a fast (O(1)) way to find the position of the set bit in a one-hot binary sequence? [duplicate] - bit-manipulation

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!

Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming

Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.

There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.

Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.

Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}

The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}

Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.

Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).

After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20

It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}

Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...

Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */

According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];

Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}

You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.

See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.

Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13

If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.

This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");

Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1

unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.

recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)

If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

Related

How does GCC optimize this `switch`

Today I discovered that GCC does some amazing magic for optimizing switches in this code:
StairsType GetStairsType(uint8_t tileId, uint8_t dlvl)
{
if (dlvl == 0)
return StairsType::Part;
if (tileId == 48) {
return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
}
switch (tileId) {
case 57: return StairsType::Down;
// many more tile IDs go here
}
}
Have a look at this 🪄:
https://godbolt.org/z/snY3jv8Wz
(gcc is on the left, clang is on the right)
Somehow GCC manages to compile this to 1/4 of the code compared to Clang.
What does this asm do, conceptually?
How does GCC do this?
Both compilers compile this part in the obvious way:
if (dlvl == 0)
return StairsType::Part;
When it comes to this part:
if (tileId == 48) {
// This ID is used by both Caves and Crypt.
return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
}
gcc checks tileId==48 directly, while clang decides to merge it into the switch statement.
Both compilers decide to calculate dlvl >= 21 ? 1 : 4 branchlessly in more-or-less the same way, but with different exact instructions. GCC sneakily uses the carry flag to calculate ((dlvl >= 21 ? 0 : -1) & 3) + 1 while clang straightforwardly calculates ((dlvl < 21) * 3) + 1.
// GCC
cmp sil, 21
sbb eax, eax
and eax, 3
add eax, 1
// clang
xor eax, eax
cmp sil, 21
setb al
lea eax, [rax + 2*rax]
inc eax
When it comes to the switch statement, clang implements it with a jump table. The lowest entry is 16 and the highest is 160, so it subtracts 16 and then checks whether the number is greater than 144. There's a well-known trick to save one check, where numbers smaller than 16 wrap around to very big numbers, so they're greater than 144. For whatever reason, it chooses to preload the number 1 (StairsType::Down) into the return value register before doing the jump.
Meanwhile on gcc, it first checks if the tile ID is between 16 and 78. If so, it subtracts 16 and checks some bitmasks:
if(tileID < 16) {
return 0;
} else if(tileID <= 78) {
mask = (1 << (tileID - 16));
if(2307251517974380578 & mask) return 2;
if(4611688220747366401 & mask) return 1;
return ((421920768 & mask) != 0) << 2;
} else {
tileID += 125; // The same as tileID -= 131
// because it's being treated as a byte
if(tileID > 29) { // values lower than 131 wrapped around
return 0;
}
// notice that if we get here it means the tileID is >= 131 and <= 160
// << only looks at the bottom 5 bits of tileID (i.e. tileID & 31)
return -((541065279 & (1 << tileID)) != 0) & 3;
}
Let's try that again, but with the bitmasks in binary and let's figure out what those return statements return:
if(tileID < 16) {
return StairsType::Invalid;
} else if(tileID <= 78) {
mask = (1 << (tileID - 16));
// tile 17, 21, 35, 36, 38, 51, 56, 64, 66, 77
if(0010000000000101000000010000100000000000010110000000000000100010 & mask) return StairsType::Up;
// tile 16, 39, 40, 46, 47, 57, 78
if(0100000000000000000000100000000011000100100000000000000000000001 & mask) return StairsType::Down;
// tile 33, 34, 37, 40, 43, 44
return ((00011001001001100000000000000000 & mask) != 0) ? StairsType::Part : StairsType::Invalid;
} else {
if(tileID < 131 || tileID > 160) {
return 0;
}
// tile 131, 132, 133, 134, 135, 136, 153, 160
return (00100000010000000000000000111111 & (1 << (tileID - 131))) ? StairsType::ShortcutToTown : StairsType::Invalid;
}
Seems that the compiler noticed that you grouped your tile IDs into somewhat logical groups. E.g. above 130 there are only shortcuts to town.
I have no idea how compiler writers come up with this stuff.

How to write `a >>= std::countr_zero(a);` in C? [duplicate]

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

How can I simulate binary values using a vector of booleans in C++?

I want to be able to retain the same amount of bits to my vector whilst still performing binary addition. For example.
int numOfBits = 4;
int myVecVal = 3;
vector< bool > myVec;
GetBinaryVector(&myVec,myVecVal, numOfBits);
and its output would be:
{0, 0, 1, 1}
I don't know how to make a function of GetBinaryVector though, any ideas?
This seems to work (although the article I added in initial comment seem to suggest you only have byte level access):
void GetBinaryVector(vector<bool> *v, int val, int bits) {
v->resize(bits);
for(int i = 0; i < bits; i++) {
(*v)[bits - 1 - i] = (val >> i) & 0x1;
}
}
The left hand side sets the i'th least significant bit which is index bits - 1 - i. The right hand side isolates the i'th least significant bit by bit shifting the value down i'th bit and masking everything but the least significant bit.
In your example val = 8, bits = 15. In the first iteration i = 0: we have (*v)[15 - 1 - 0] = (8 >> 0) & 0x1. 8 is binary 1000 and shifting it down 0 is 1000. 1000 & 0x1 is 0. Let's jump to i = 4: (*v)[15 - 1 - 4] = (8 >> 4) & 0x1. 1000 >> 4 is 1 and 1 & 0x1 is 1, so we set (*v)[10] = 1. The resulting vector is { 0, ..., 0, 1, 0, 0, 0 }

A number as it's prime number parts

I have to print the number of ways you can represent a given number as it's prime number parts.
Let me clarify: Let's say I have been given this number 7. Now, first of all, I have to find all the prime numbers that are less than 7, which are 2, 3 and 5. Now, in how many ways can I summarize those numbers (I can use one number as many times I want) so that the result equals 7? For example, number 7 has five ways:
2 + 2 + 3
2 + 3 + 2
2 + 5
3 + 2 + 2
5 + 2
I'm totally lost with this task. First I figured I'd make an array of usable elements like so: { 2, 2, 2, 3, 3, 5 } (7/2 = 3, so 2 must appear three times. Same goes with 3, which gets two occurences). After that, loop through the array and choose a 'leader' that determines how far in the array we are. I know the explanation is horrible, so here's the code:
#include <iostream>
#include <vector>
int primes_all[25] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97};
int main()
{
int number;
std::cin >> number;
std::vector<int> primes_used;
for(int i = 0; i < 25; i++) {
if(primes_all[i] < number && number-primes_all[i] > 1) {
for(int k = 0; k < number/primes_all[i]; k++)
primes_used.push_back(primes_all[i]);
}
else break;
}
int result = 0;
for(size_t i = 0; i < primes_used.size(); i++) {
int j = primes_used.size()-1;
int new_num = number - primes_used[i];
while(new_num > 1 && j > -1)
{
if(j > -1) while(primes_used[j] > new_num && j > 0) j--;
if(j != i && j > -1) {
new_num -= primes_used[j];
std::cout << primes_used[i] << " " << primes_used[j] << " " << new_num << std::endl;
}
j--;
}
if(new_num == 0) result++;
}
std::cout << result << std::endl;
system("pause");
return 0;
}
This simply doesn't work. Simply because the idea behind it is wrong. Here's a little details about the limits:
Time limit: 1 second
Memory limit: 128 MB
Also, the biggest number that can be given is 100. That's why I made the array of prime numbers below 100. The result grows very fast as the given number gets bigger, and will need a BigInteger class later on, but that's not an issue.
A few results known:
Input Result
7 5
20 732
80 10343662267187
SO... Any ideas? Is this a combinatory problem? I don't need code, just an idea. I'm still a newbie to C++ but I'll manage
Keep in mind that 3 + 2 + 2 is different than 2 + 3 + 2.
Also, were the given number to be a prime itself, it won't be counted. For example, if the given number is 7, only these sums are valid:
2 + 2 + 3
2 + 3 + 2
2 + 5
3 + 2 + 2
5 + 2
7 <= excluded
Dynamic programming is your friend here.
Consider the number 27.
If 7 has 5 results, and 20 has 732 results, then you know that 27 has at least (732 * 5) results. You can use a two variable system (1 + 26, 2 + 25 ... etc) using the precomputed values for those as you go. You don't have to recompute 25 or 26 because you already did them.
The concept you are searching for is the "prime partitions" of a number. S partition of a number is a way of adding numbers to reach the target; for instance, 1+1+2+3 is a partition of 7. If all the addends are prime, then the partition is a prime partition.
I think your example is wrong. The number 7 is usually considered to have 3 prime partitions: 2+2+3, 2+5, and 7. The order of the addends doesn't matter. In number theory the function that counts prime partitions is kappa, so we would say kappa(7) = 3.
The usual calculation of kappa is done in two parts. The first part is a function to compute the sum of the prime factors of a number; for instance, 42=2·3·7, so sopf(42)=12. Note that sopf(12)=5 because the sum is over only the distinct factors of a number, so even though 12=2·2·3, only one 2 is included in the calculation of the sum.
Given sopf, there is a lengthy formula to calculate kappa; I'll give it in LaTeX form, since I don't know how to enter it here: \kappa(n) = \frac{1}{n}\left(\mathrm{sopf}(n) + \sum_{j=1}^{n-1} \mathrm{sopf}(j) \cdot \kappa(n-j)\right).
If you actually want a list of the partitions, instead of just the count, there is a dynamic programming solution that #corsiKa pointed out.
I discuss prime partitions in more detail at my blog, including source code to produce both the count and the list.
Here's an efficient implementation which uses dynamic programming like corsiKa suggests, but does not use the algorithm he describes.
Simply: if n is reachable via k distinct paths (including the single-step one, if it exists), and p is prime, then we construct k paths to n+p by appending p to all paths to n. Considering all such n < N will produce an exhaustive list of valid paths to N. So we just sum the number of paths so discovered.
#include <iostream>
int primes_all[] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97};
const int N_max = 85;
typedef long long ways;
ways ways_to_reach_N[N_max + 1] = { 1 };
int main()
{
// find all paths
for( int i = 0; i <= N_max; ++i ) {
ways ways_to_reach_i = ways_to_reach_N[i];
if (ways_to_reach_i) {
for( int* p = primes_all; *p <= N_max - i && p < (&primes_all)[1]; ++p ) {
ways_to_reach_N[i + *p] += ways_to_reach_i;
}
}
}
// eliminate single-step paths
for( int* p = primes_all; *p <= N_max && p < (&primes_all)[1]; ++p ) {
--ways_to_reach_N[*p];
}
// print results
for( int i = 1; i <= N_max; ++i ) {
ways ways_to_reach_i = ways_to_reach_N[i];
if (ways_to_reach_i) {
std::cout << i << " -- " << ways_to_reach_i << std::endl;
}
}
return 0;
}
Demo: http://ideone.com/xWZT8v
Replacing the typedef ways with a big integer type is left as an exercise to the reader.

Position of least significant bit that is set

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.