How does GCC optimize this `switch` - c++

Today I discovered that GCC does some amazing magic for optimizing switches in this code:
StairsType GetStairsType(uint8_t tileId, uint8_t dlvl)
{
if (dlvl == 0)
return StairsType::Part;
if (tileId == 48) {
return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
}
switch (tileId) {
case 57: return StairsType::Down;
// many more tile IDs go here
}
}
Have a look at this 🪄:
https://godbolt.org/z/snY3jv8Wz
(gcc is on the left, clang is on the right)
Somehow GCC manages to compile this to 1/4 of the code compared to Clang.
What does this asm do, conceptually?
How does GCC do this?

Both compilers compile this part in the obvious way:
if (dlvl == 0)
return StairsType::Part;
When it comes to this part:
if (tileId == 48) {
// This ID is used by both Caves and Crypt.
return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
}
gcc checks tileId==48 directly, while clang decides to merge it into the switch statement.
Both compilers decide to calculate dlvl >= 21 ? 1 : 4 branchlessly in more-or-less the same way, but with different exact instructions. GCC sneakily uses the carry flag to calculate ((dlvl >= 21 ? 0 : -1) & 3) + 1 while clang straightforwardly calculates ((dlvl < 21) * 3) + 1.
// GCC
cmp sil, 21
sbb eax, eax
and eax, 3
add eax, 1
// clang
xor eax, eax
cmp sil, 21
setb al
lea eax, [rax + 2*rax]
inc eax
When it comes to the switch statement, clang implements it with a jump table. The lowest entry is 16 and the highest is 160, so it subtracts 16 and then checks whether the number is greater than 144. There's a well-known trick to save one check, where numbers smaller than 16 wrap around to very big numbers, so they're greater than 144. For whatever reason, it chooses to preload the number 1 (StairsType::Down) into the return value register before doing the jump.
Meanwhile on gcc, it first checks if the tile ID is between 16 and 78. If so, it subtracts 16 and checks some bitmasks:
if(tileID < 16) {
return 0;
} else if(tileID <= 78) {
mask = (1 << (tileID - 16));
if(2307251517974380578 & mask) return 2;
if(4611688220747366401 & mask) return 1;
return ((421920768 & mask) != 0) << 2;
} else {
tileID += 125; // The same as tileID -= 131
// because it's being treated as a byte
if(tileID > 29) { // values lower than 131 wrapped around
return 0;
}
// notice that if we get here it means the tileID is >= 131 and <= 160
// << only looks at the bottom 5 bits of tileID (i.e. tileID & 31)
return -((541065279 & (1 << tileID)) != 0) & 3;
}
Let's try that again, but with the bitmasks in binary and let's figure out what those return statements return:
if(tileID < 16) {
return StairsType::Invalid;
} else if(tileID <= 78) {
mask = (1 << (tileID - 16));
// tile 17, 21, 35, 36, 38, 51, 56, 64, 66, 77
if(0010000000000101000000010000100000000000010110000000000000100010 & mask) return StairsType::Up;
// tile 16, 39, 40, 46, 47, 57, 78
if(0100000000000000000000100000000011000100100000000000000000000001 & mask) return StairsType::Down;
// tile 33, 34, 37, 40, 43, 44
return ((00011001001001100000000000000000 & mask) != 0) ? StairsType::Part : StairsType::Invalid;
} else {
if(tileID < 131 || tileID > 160) {
return 0;
}
// tile 131, 132, 133, 134, 135, 136, 153, 160
return (00100000010000000000000000111111 & (1 << (tileID - 131))) ? StairsType::ShortcutToTown : StairsType::Invalid;
}
Seems that the compiler noticed that you grouped your tile IDs into somewhat logical groups. E.g. above 130 there are only shortcuts to town.
I have no idea how compiler writers come up with this stuff.

Related

How to write `a >>= std::countr_zero(a);` in C? [duplicate]

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

Is there are a fast (O(1)) way to find the position of the set bit in a one-hot binary sequence? [duplicate]

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

Fastest way to get IPv4 address from string

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}
Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see #IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
And here is the required precalculation for shuffleTable:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).
First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT: you can find a version modified by #IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see #PeterCordes and #IwillnotexistIdonotexist
add rcx, rcx //can be removed, see #EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
P.S. If you are still reading it, be sure to check out comments =)
As for alternatives: this is similar to yours but with some error checking:
#include <iostream>
#include <string>
#include <cstdint>
uint32_t getip(const std::string &sip)
{
uint32_t r=0, b, p=0, c=0;
const char *s;
s = sip.c_str();
while (*s)
{
r<<=8;
b=0;
while (*s&&((*s==' ')||(*s=='\t'))) s++;
while (*s)
{
if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
if (*s=='.') { p++; s++; break; }
if ((*s>='0')&&(*s<='9'))
{
b*=10;
b+=(*s-'0');
s++;
}
}
if ((b>255)||(*s=='.')) return 0;
r+=b;
c++;
}
return ((c==4)&&(p==3))?r:0;
}
void testip(const std::string &sip)
{
uint32_t nIP=0;
nIP = getip(sip);
std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}
int main()
{
testip("192.167.1.3");
testip("292.167.1.3");
testip("192.267.1.3");
testip("192.167.1000.3");
testip("192.167.1.300");
testip("192.167.1.");
testip("192.167.1");
testip("192.167..1");
testip("192.167.1.3.");
testip("192.1 67.1.3.");
testip("192 . 167 . 1 . 3");
testip(" 192 . 167 . 1 . 3 ");
return 0;
}

Get next number from range in lexicographical order (without container of all stringifications)

How have/would you design an function that on each call returns the next value in a nominated numeric range in lexicographical order of string representation...?
Example: range 8..203 --> 10, 100..109, 11, 110..119, 12, 120..129, 13, 130..139, ..., 19, 190..199, 20, 200..203, 30..99.
Constraints: indices 0..~INT_MAX, fixed space, O(range-length) performance, preferably "lazy" so if you stop iterating mid way you haven't wasted processing effort. Please don't post brute force "solutions" iterating numerically while generating strings that are then sorted.
Utility: if you're generating data that ultimately needs to be lexicographically presented or processed, a lexicographical series promises lazy generation as needed, reduces memory requirements and eliminates a sort.
Background: when answering this question today, my solution gave output in numeric order (i.e. 8, 9, 10, 11, 12), not lexicographical order (10, 11, 12, 8, 8) as illustrated in the question. I imagined it would be easy to write or find a solution, but my Google-foo let me down and it was trickier than I expected, so I figured I'd collect/contribute here....
(Tagged C++ as it's my main language and I'm personally particularly interested in C++ solutions, but anything's welcome)
Somebody voted to close this because I either didn't demonstrate a minimal understanding of the problem being solved (hmmmm!?! ;-P), or an attempted solution. My solution is posted as an answer as I'm happy for it to be commented on and regailed in the brutal winds of Stack Overflow wisdom.... O_o
This is actually quite easy. First an observation:
Theorem: if two numbers x and y such that x < y are in the series and these numbers have the same number of digits, then x comes before y.
Proof: let's view digits of x as xn..x0 and digits of y as yn...y0. Let's take the left most digit that these two differ in, assumed to be at index i. Therefore, we have:
y = yn...yiy(i-1)...y0
x = yn...yix(i-1)...x0
since all digits from n to i are the same in both numbers. If x < y, then mathematically:
x(i-1) < y(i-1)
Lexicographically, if the digit x(i-1) is smaller than the digit y(i-1), then x comes before y.
This theorem means that in your specified range of [a, b], you have numbers with different number of digits, but the ones that have the same number of digits are in their mathematical order.
Building on that, here's a simple algorithm. First, let's say a has m digits and b has n digits (n >= m)
1. create a heap with lexicographical order
2. initially, insert `a` and `10^i` for i in [n + 1, m]
3. while the heap is not exhausted
3.1. remove and yield the top of the heap (`next`) as next result
3.2. if `next + 1` is still in range `[a, b]` (and doesn't increase in digits), insert it in heap
Notes:
In step 2, you are inserting the starting numbers of each series of numbers that have the same number of digits.
To change to a function that returns a number on each call, step 3.1 should be changed to store the state of the algorithm and resume on next call. Pretty standard.
Step 3.2 is the part that exploits the above theorem and keeps only the next number in mathematical order in the heap.
Assuming N = b - a, The extra space used by this algorithm is O(log N) and it's time complexity is O(N * log log N).
Here's my attempt, in Python:
import math
#iterates through all numbers between start and end, that start with `cur`'s digits
def lex(start, end, cur=0):
if cur > end:
return
if cur >= start:
yield cur
for i in range(0,10):
#add 0-9 to the right of the current number
next_cur = cur * 10 + i
if next_cur == 0:
#we already yielded 0, no need to do it again
continue
for ret in lex(start, end, next_cur):
yield ret
print list(lex(8, 203))
Result:
[10, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 11, 110, 111, 112, 113,
114, 115, 116, 117, 118, 119, 12, 120, 121, 122, 123, 124, 125, 126, 127, 128,
129, 13, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 14, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 15, 150, 151, 152, 153, 154, 155, 156, 157,
158, 159, 16, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 17, 170, 171,
172, 173, 174, 175, 176, 177, 178, 179, 18, 180, 181, 182, 183, 184, 185, 186,
187, 188, 189, 19, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 20, 200,
201, 202, 203, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 8, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 9, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99]
This uses O(log(end)) stack space, which is bounded by INT_MAX, so it won't go any deeper than five calls for your typical 16 bit int. It runs in O(end) time, since it has to iterate through numbers smaller than start before it can begin yielding valid numbers. This can be considerably worse than O(end-start) if start and end are large and close together.
Iterating through lex(0, 1000000) takes about six seconds on my machine, so it appears to be slower than Tony's method but faster than Shahbaz's. Of course, it's challenging to make a direct comparison since I'm using a different language.
This is a bit of a mess, so I'm curious to see how other people tackle it. There are so many edge cases explicitly handled in the increment operator!
For range low to high:
0 is followed by 1
numbers shorter than high are always followed by 0-appended versions (e.g. 12->120)
numbers other than high that end in 0-8 are followed by the next integer
when low has as many digits as high, you finish after high (return sentinel high + 1)
otherwise you finish at a number 999... with one less digit than high
other numbers ending in 9(s) have the part before the trailing 9s incremented, but if that results in trailing 0s they're removed providing the number's still more than low
template <typename T>
std::string str(const T& t)
{
std::ostringstream oss; oss << t; return oss.str();
}
template <typename T>
class Lex_Counter
{
public:
typedef T value_type;
Lex_Counter(T from, T to, T first = -1)
: from_(from), to_(to),
min_size_(str(from).size()), max_size_(str(to).size()),
n_(first != -1 ? first : get_first()),
max_unit_(pow(10, max_size_ - 1)), min_unit_(pow(10, min_size_ - 1))
{ }
operator T() { return n_; }
T& operator++()
{
if (n_ == 0)
return n_ = 1;
if (n_ < max_unit_ && n_ * 10 <= to_)
return n_ = n_ * 10; // e.g. 10 -> 100, 89 -> 890
if (n_ % 10 < 9 && n_ + 1 <= to_)
return ++n_; // e.g. 108 -> 109
if (min_size_ == max_size_
? n_ == to_
: (n_ == max_unit_ - 1 && to_ < 10 * max_unit_ - 10 || // 99/989
n_ == to_ && to_ >= 10 * max_unit_ - 10)) // eg. 993
return n_ = to_ + 1;
// increment the right-most non-9 digit
// note: all-9s case handled above (n_ == max_unit_ - 1 etc.)
// e.g. 109 -> 11, 19 -> 2, 239999->24, 2999->3
// comments below explain 230099 -> 230100
// search from the right until we have exactly non-9 digit
for (int k = 100; ; k *= 10)
if (n_ % k != k - 1)
{
int l = k / 10; // n_ 230099, k 1000, l 100
int r = ((n_ / l) + 1) * l; // 230100
if (r > to_ && r / 10 < from_)
return n_ = from_; // e.g. from_ 8, r 20...
while (r / 10 >= from_ && r % 10 == 0)
r /= 10; // e.g. 230100 -> 2301
return n_ = r <= from_ ? from_ : r;
}
assert(false);
}
private:
T get_first() const
{
if (min_size_ == max_size_ ||
from_ / min_unit_ < 2 && from_ % min_unit_ == 0)
return from_;
// can "fall" from e.g. 321 to 1000
return min_unit_ * 10;
}
T pow(T n, int exp)
{ return exp == 0 ? 1 : exp == 1 ? n : 10 * pow(n, exp - 1); }
T from_, to_;
size_t min_size_, max_size_;
T n_;
T max_unit_, min_unit_;
};
Performance numbers
I can count from 0 to 1 billion in under a second on a standard Intel machine / single threaded, MS compiler at -O2.
The same machine / harness running my attempt at Shahbaz's solution - below - takes over 3.5 second to count to 100,000. Maybe the std::set isn't a good heap/heap-substitute, or there's a better way to use it? Any optimisation suggestions welcome.
template <typename T>
struct Shahbaz
{
std::set<std::string> s;
Shahbaz(T from, T to)
: to_(to)
{
s.insert(str(from));
for (int n = 10; n < to_; n *= 10)
if (n > from) s.insert(str(n));
n_ = atoi(s.begin()->c_str());
}
operator T() const { return n_; }
Shahbaz& operator++()
{
if (s.empty())
n_ = to_ + 1;
else
{
s.erase(s.begin());
if (n_ + 1 <= to_)
{
s.insert(str(n_ + 1));
n_ = atoi(s.begin()->c_str());
}
}
return *this;
}
private:
T n_, to_;
};
Perf code for reference...
void perf()
{
DWORD start = GetTickCount();
int to = 1000 *1000;
// Lex_Counter<int> counter(0, to);
Shahbaz<int> counter(0, to);
while (counter <= to)
++counter;
DWORD elapsed = GetTickCount() - start;
std::cout << '~' << elapsed << "ms\n";
}
Some Java code (deriving C++ code from this should be trivial), very similar to Kevin's Python solution:
public static void generateLexicographical(int lower, int upper)
{
for (int i = 1; i < 10; i++)
generateLexicographical(lower, upper, i);
}
private static void generateLexicographical(int lower, int upper, int current)
{
if (lower <= current && current <= upper)
System.out.println(current);
if (current > upper)
return;
for (int i = 0; i < 10; i++)
generateLexicographical(lower, upper, 10*current + i);
}
public static void main(String[] args)
{
generateLexicographical(11, 1001);
}
The order of the if-statements are not important, and one can be made an else of the other, but changing them in any way strangely enough makes it take about 20% longer.
This just starts with each number from 1 to 10, then recursively appends each possible number from 0 to 10 to that number, until we get a number bigger than the upper limit.
It similarly uses O(log upper) space (every digit requires a stack frame) and O(upper) time (we go from 1 to upper).
I/O is obviously the most time-consuming part here. If that is removed and replaced by just incrementing a variable, generateLexicographical(0, 100_000_000); takes about 4 seconds, but by no means taken from a proper benchmark.

Position of least significant bit that is set

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.