Related
I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.
I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.
I have a fixed size boolean array of size 8. The default value of all elements in the array is false. There will be a number of truth values to fill between 1-8.
I want to distribute the truth values as far away from one another as possible. I also wish to be able to randomize the configuration. In this scenario the array wraps around so position 7 is "next to" position 0 in the array.
here are some examples for fill values. I didn't include all possibilities, but hopefully it gets my point across.
1: [1, 0, 0, 0, 0, 0, 0, 0] or [0, 1, 0, 0, 0, 0, 0, 0]
2: [1, 0, 0, 0, 1, 0, 0, 0] or [0, 1, 0, 0, 0, 1, 0, 0]
3: [1, 0, 0, 1, 0, 0, 1, 0] or [0, 1, 0, 0, 1, 0, 0, 1]
4: [1, 0, 1, 0, 1, 0, 1, 0] or [0, 1, 0, 1, 0, 1, 0, 1]
5: [1, 1, 0, 1, 1, 0, 1, 0]
6: [1, 1, 0, 1, 1, 1, 0, 1]
7: [1, 1, 1, 1, 1, 1, 1, 0]
8: [1, 1, 1, 1, 1, 1, 1, 1]
The closest solution I have come up with so far hasn't quite produced the results I'm looking for...
I seek to write it in c++ but here is a little pseudo-code of my algorithm so far...
not quite working out how I wanted
truths = randBetween(1, 8)
values = [0,0,0,0,0,0,0,0]
startPosition = randBetween(0, 7) //starting index
distance = 4
for(i = 0; i < truths; i++) {
pos = i + startPosition + (i * distance)
values[pos % 8] = 1
}
this is an example output from my current code. those marked with a star are incorrect.
[0, 0, 0, 0, 1, 0, 0, 0]
[0, 1, 0, 0, 1, 0, 0, 0]*
[0, 1, 0, 0, 1, 0, 1, 0]
[0, 1, 0, 1, 1, 0, 1, 0]*
[1, 1, 0, 1, 1, 0, 1, 0]
[1, 1, 0, 1, 1, 1, 1, 0]*
[1, 1, 1, 1, 1, 1, 1, 0]
[1, 1, 1, 1, 1, 1, 1, 1]
I'm looking for a simple way to distribute the truth values evenly throughout the array without having to code for special cases.
Check this out:
#include <cassert>
#include <vector>
#include <iostream>
#include <iomanip>
/**
* Generate an even spaced pattern of ones
* #param arr destination vector of ints
* #param onescnt the requested number of ones
*/
static inline
void gen(std::vector<int>& arr, size_t onescnt) {
const size_t len = arr.size();
const size_t zeroscnt = len - onescnt;
size_t ones = 1;
size_t zeros = 1;
for (size_t i = 0; i < len; ++i) {
if (ones * zeroscnt < zeros * onescnt) {
ones++;
arr[i] = 1;
} else {
zeros++;
arr[i] = 0;
}
}
}
static inline
size_t count(const std::vector<int>& arr, int el) {
size_t cnt = 0;
for (size_t i = 0; i < arr.size(); ++i) {
cnt += arr[i] == el;
}
return cnt;
}
static inline
void gen_print(size_t len, size_t onescnt) {
std::vector<int> arr(len);
gen(arr, onescnt);
std::cout << "gen_printf(" << std::setw(2) << len << ", " << std::setw(2) << onescnt << ") = {";
for (size_t i = 0; i < len; ++i) {
std::cout << arr[i] << ",";
}
std::cout << "}\n";
assert(count(arr, 1) == onescnt);
}
int main() {
for (int i = 0; i <= 8; ++i) {
gen_print(8, i);
}
for (int i = 0; i <= 30; ++i) {
gen_print(30, i);
}
return 0;
}
Generates:
gen_printf( 8, 0) = {0,0,0,0,0,0,0,0,}
gen_printf( 8, 1) = {0,0,0,0,0,0,0,1,}
gen_printf( 8, 2) = {0,0,0,1,0,0,0,1,}
gen_printf( 8, 3) = {0,1,0,0,1,0,0,1,}
gen_printf( 8, 4) = {0,1,0,1,0,1,0,1,}
gen_printf( 8, 5) = {1,0,1,1,0,1,0,1,}
gen_printf( 8, 6) = {1,1,0,1,1,1,0,1,}
gen_printf( 8, 7) = {1,1,1,1,1,1,0,1,}
gen_printf( 8, 8) = {1,1,1,1,1,1,1,1,}
gen_printf(30, 0) = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
gen_printf(30, 1) = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,}
gen_printf(30, 2) = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,}
gen_printf(30, 3) = {0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,}
gen_printf(30, 4) = {0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,}
gen_printf(30, 5) = {0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,}
gen_printf(30, 6) = {0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,}
gen_printf(30, 7) = {0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,}
gen_printf(30, 8) = {0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,}
gen_printf(30, 9) = {0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,}
gen_printf(30, 10) = {0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,}
gen_printf(30, 11) = {0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,}
gen_printf(30, 12) = {0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,}
gen_printf(30, 13) = {0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,}
gen_printf(30, 14) = {0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,}
gen_printf(30, 15) = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,}
gen_printf(30, 16) = {1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,}
gen_printf(30, 17) = {1,0,1,0,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,1,}
gen_printf(30, 18) = {1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,1,}
gen_printf(30, 19) = {1,0,1,1,0,1,1,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,1,0,1,1,0,1,0,1,}
gen_printf(30, 20) = {1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,}
gen_printf(30, 21) = {1,1,0,1,1,0,1,1,0,1,1,1,0,1,1,0,1,1,0,1,1,1,0,1,1,0,1,1,0,1,}
gen_printf(30, 22) = {1,1,0,1,1,1,0,1,1,1,0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,0,1,}
gen_printf(30, 23) = {1,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1,0,1,}
gen_printf(30, 24) = {1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,}
gen_printf(30, 25) = {1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,}
gen_printf(30, 26) = {1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,}
gen_printf(30, 27) = {1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,}
gen_printf(30, 28) = {1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,}
gen_printf(30, 29) = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,}
gen_printf(30, 30) = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,}
#edit - better evenly spaced pattern.
Explanation:
So let's take an array of 8 ints and we want to have 5 ones. The ideal ratio of (ones / zeros) in a sequence with 8 elements and 5 ones, well would be (5 / 3). We will never approach such ratio, but we can try.
The idea is to loop through the array and remember the number of ones and zeros we have written in the array. If the ratio of (written ones / written zeros) is lower then the destination ratio (ones / zeros) we want to achieve, we need to put a one to the sequence. Otherwise we put zero in the sequence. The ratio changes and we make the decision next time. The idea is to pursue the ideal ratio of ones per zeros in each slice of the array.
A simple way to do this would be to round the ideal fractional positions.
truths = randBetween(1, 8)
values = [0,0,0,0,0,0,0,0]
offset = randBetween(0, 8 * truths - 1)
for(i = 0; i < truths; i++) {
pos = (offset + (i * 8)) / truths
values[pos % 8] = 1
}
This is an application of Bresenham's line-drawing algorithm. I use it not because it's fast on old hardware, but it places true values exactly.
#include <iostream>
#include <stdexcept>
#include <string>
#include <random>
int main(int argc, char **argv) {
try {
// Read the argument.
if(argc != 2) throw std::invalid_argument("one argument");
int dy = std::stoi(argv[1]);
if(dy < 0 || dy > 8) throw std::out_of_range("[0..8]");
int values[8] = {0};
// https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm
int dx = 8;
int delta = 2 * dy - dx; // Balance the line. Permute it up later.
for(int x = 0; x < dx; x++) {
if(delta > 0) {
values[x] = 1;
delta -= 2 * dx;
}
delta += 2 * dy;
}
for(int x = 0; x < dx; x++)
std::cout << (x ? ", " : "") << values[x];
std::cout << std::endl;
// Rotate the number by a random amount.
// I'm sure there is an easier way to do this.
// https://stackoverflow.com/questions/7560114/random-number-c-in-some-range
std::random_device rd; // obtain a random number from hardware
std::mt19937 eng(rd()); // seed the generator
std::uniform_int_distribution<> distr(0, dx - 1);
int rotate = distr(eng);
bool first = true;
int x = rotate;
do {
std::cout << (first ? "" : ", ") << values[x];
first = false;
x = (x + 1) % dx;
} while(x != rotate);
std::cout << std::endl;
} catch(const std::exception &e) {
std::cerr << "Something went wrong: " << e.what() << std::endl;
return 1;
}
return 0;
}
Once you have an exact solution, rotate it by a random amount.
0, 1, 0, 0, 1, 0, 1, 0
1, 0, 0, 1, 0, 0, 1, 0
You need to calculate distance dynamically. One element is clear, that can reside at arbitrary location
2 elements is clear, too, distance needs to be 4.
4 elements need a distance of 2
8 elements a distance of 1
More difficult are numbers that don't divide the array:
3 requires a distance of 2.66.
5 requires a distance of 1.6
7 requires a distance of 0.875
Errm... In general, if you have a distance of X.Y, you will have to place some of the elements at distances of X and some at distances of X + 1. X is simple, it will be the result of an integer division: 8 / numberOfElements. The remainder will determine how often you will have to switch to X + 1: 8 % numberOfElements. For 3, this will result in 2, too, so you will have 1x distance of 2 and 2x distance of 3:
[ 1 0 1 0 0 1 0 0 ]
2 3 3 (distance to very first 1)
For 5, you'll get: 8/5 = 1, 8%5 = 3, so: 2x distance of 1, 3x distance of 2
[ 1 1 1 0 1 0 1 0 ]
1 1 2 2 2
For 7 you'll get: 8/7 = 1, 8%7 = 1, so: 7x distance of 1, 1x distance of 2
[ 1 1 1 1 1 1 1 0 ]
1 1 1 1 1 1 2
That will work for arbitrary array length L:
L/n = minimum distance
L%n = number of times to apply minimum distance
L-L%n = number of times to apply minimum distance + 1
Mathematical metrics won't reveal any difference between first applying all smaller distances then all larger ones, human sense for aesthetics, though, might prefer if you alternate between larger and smaller as often as possible – or you apply the algorithm recursively (for larger array length), to get something like 2x2, 3x3, 2x2, 3x3 instead of 4x2 and 6x3.
I'm trying to calculate the points in a cuboid given its centre (which is a Vector3) and the lengths of the sides along the x, y and z axis. I found the following on math.stackexchange.com: https://math.stackexchange.com/questions/107778/simplest-equation-for-drawing-a-cube-based-on-its-center-and-or-other-vertices which says I can use the following formulae:
The constructor for the World class is:
World::World(Vector3 o, float d1, float d2, float d3) : origin(o)
{
// If we consider an edge length to be d, we need to find r such that
// 2r = d in order to calculate the positions of each vertex in the world.
float r1 = d1 / 2,
r2 = d2 / 2,
r3 = d3 / 2;
for (int i = 0; i < 8; i++)
{
/* Sets up the vertices of the cube.
*
* #see http://bit.ly/1cc2RPG
*/
float x = o.getX() + (std::pow(-1, i&1) * r1),
y = o.getY() + (std::pow(-1, i&2) * r2),
z = o.getZ() + (std::pow(-1, i&4) * r3);
points[i] = Vector3(x, y, z);
std::cout << points[i] << "\n";
}
}
And I passing the following parameters to the constructor:
Vector3 o(0, 0, 0);
World w(o, 100.f, 100.f, 100.f);
The coordinates being output for all 8 vertices are:
(50, 50, 50)
(-50, 50, 50)
(50, 50, 50)
(-50, 50, 50)
(50, 50, 50)
(-50, 50, 50)
(50, 50, 50)
(-50, 50, 50)
Which cannot be correct. Any guidance would be very much appreciated!
The problem lies in the bitwise & inside your pow calls:
In the y and z components, they always return 0 and 2 or 4, respectively. -1^2 = -1^4 = 1, which is why the sign of these components is always positive. You could try (i&2)!=0 or (i&2) >> 1 for the y component instead. The same goes for the z component.
Change this:
float x = o.getX() + (std::pow(-1, i&1) * r1),
y = o.getY() + (std::pow(-1, i&2) * r2),
z = o.getZ() + (std::pow(-1, i&4) * r3);
To this:
float x = o.getX() + (std::pow(-1, (i ) & 1) * r1), // pow(-1, 0) == 1, pow(-1, 1) == -1
y = o.getY() + (std::pow(-1, (i >> 1) & 1) * r2), // pow(-1, 0) == 1, pow(-1, 1) == -1
z = o.getZ() + (std::pow(-1, (i >> 2) & 1) * r3); // pow(-1, 0) == 1, pow(-1, 1) == -1
Or even to this:
float x = o.getX() + (std::pow(-1, (i )) * r1), // pow(-1, {0, 2, 4, 6}) == 1, pow(-1, {1, 3, 5, 7}) == -1
y = o.getY() + (std::pow(-1, (i >> 1)) * r2), // pow(-1, {0, 2}) == 1, pow(-1, {1, 3}) == -1
z = o.getZ() + (std::pow(-1, (i >> 2)) * r3); // pow(-1, 0) == 1, pow(-1, 1) == -1
The problem is that as written even though the values you mask out identify weather or not the lengths need to be negated. They are not in the correct place value to get the desired properties from the exponentiation of -1.
Rewriting the code as I have above will solve this issue, however it would be more readable and in general more permanent just to unroll the loop and manually write if each one is an addition or subtraction without using the pow function.
I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.