How C++ represents state flags related to the order of operations? - c++

I am now working on a motion control system. I want to make the motion control interface more flexible, so I want to provide an api that can specifiy the controlled axis.
It just like.
void moveTo(/*data*/, Axis which);
Axis parameter is used to define how the motion data should be treated as and the internal implement will check the data according the requested axis.
So i want to define a operation rules on Axis parameters.
I tried the bit flags, defined as follow.
enum class Axis : std::uint_fast16_t {
X = 1l << 0,
Y = 1l << 1,
Z = 1l << 2
};
// operator | and & and so on are also defined.
It can be used as follow.
const auto which = Axis::X | Axis::Y;
But what i want is that, Axis::X | Axis::Y and Axis::Y | Axis::X should be different, but now they have same value. There is no way to distinguish them inside the function.
How should I define such a bit mask that appears to be related to "order of operations"?
It seems to do this by listing all allowed combined states in a single enumeration. But I still want to ask if there is a more elegant and clear solution.
Or any suggestion for a better implementation of such functionality?
I hope the final usage is still as simple as the bit mask, I don't mind if I do some really complicated work behind the scenes, I don't mind using any 3rd party library.

Given that you just have 3 axes and a limited amount of combinations, I'd suggest just defining them all.
enum class Axis : std::uint_fast16_t {
X = 0,
Y = 1,
Z = 2,
XX = 3,
YX = 4,
ZX = 5,
XY = 6,
YY = 7,
ZY = 8,
XZ = 9,
YZ = 10,
ZZ = 11,
XXX = 12,
YXX = 13,
ZXX = 14,
XYX = 15,
YYX = 16,
ZYX = 17,
XZX = 18,
YZX = 19,
ZZX = 20,
XXY = 21,
YXY = 22,
ZXY = 23,
XYY = 24,
YYY = 25,
ZYY = 26,
XZY = 27,
YZY = 28,
ZZY = 29,
XXZ = 30,
YXZ = 31,
ZXZ = 32,
XYZ = 33,
YYZ = 34,
ZYZ = 35,
XZZ = 36,
YZZ = 37,
ZZZ = 38,
};
If you want to do it programmatically, what you can do is to make the flag by shifting the value by 2 every time you add a new flag, as each value is represented by 2 bits.
#include <type_traits>
#include <array>
#include <iostream>
#include <bitset>
enum class Axis : std::uint_fast16_t {
X = 0b01,
Y = 0b10,
Z = 0b11,
};
Axis append_flag(Axis current, Axis to_append) {
std::uint_fast16_t flag = static_cast<std::uint_fast16_t>(current);
flag = (flag << 2) | static_cast<std::uint_fast16_t>(to_append);
return static_cast<Axis>(flag);
}
template <class ... T>
Axis make_flag(T ... axes) {
std::array values = { axes... };
auto flag = values[0];
for (std::size_t i = 1; i < values.size(); ++i) {
flag = append_flag(flag, values[i]);
}
return flag;
}
int main(int argc, const char* argv[]) {
auto result = make_flag(Axis::X, Axis::Y, Axis::Z);
auto value = static_cast<std::size_t>(result);
std::cout << value << " (" << std::bitset<16>(value) << ")" << std::endl;
return 0;
}

Related

How to create a view over a range that always picks the first element and filters the rest?

I have a collection of values:
auto v = std::vector{43, 1, 3, 2, 4, 6, 7, 8, 19, 101};
Over this collection of values I want to apply a view that follows this criteria:
First element should always be picked.
From the next elements, pick only even numbers until ...
... finding an element equal or greater than 6.
This is the view I tried:
auto v = std::vector{43, 1, 3, 2, 4, 6, 7, 8, 19, 101};
auto r = v |
std::views::take(1) |
std::views::filter([](const int x) { return !(x & 1); }) |
std::views::take_while([](const int x) { return x < 6; });
for (const auto &x : r)
std::cout << x << ' ';
But the execution don't even enter the print loop because the view is empty. My guess si that all the criteria is applied at once:
Pick first element (43).
Is odd number.
View ends.
What I was expecting:
Pick first element without checking anything.
From the rest of elements, filter only even numbers (2, 4, 6, 8).
From filtered elements, pick numbers until a number equal to or greater than 6 appears (2, 4).
43 2 4 is printed.
How can I build a view over my collection of values that behaves as I was expecting?
With range-v3, you can use views::concat to concatenate the first element of the range and the remaining filtered elements, for example:
auto v = std::vector{43, 1, 3, 2, 4, 6, 7, 8, 19, 101};
auto r = ranges::views::concat(
v | ranges::views::take(1),
v | ranges::views::drop(1)
| ranges::views::filter([](const int x) { return !(x & 1); })
| ranges::views::take_while([](const int x) { return x < 6; })
);
Demo
Edit: My first solution will not work, as correctly pointed out by
康桓瑋
bool first = true;
auto r = v |
std::views::filter([first](const int x) { return first || !(x & 1); }) |
std::views::take_while([&first](const int x) { return std::exchange(first, false) || x < 6; });
It seems to work with two bool variables, one for filter and one for take_while, but not sure if it is really ub or not, e.g:
bool firstWhile = true;
bool firstTake = true;
auto r = v |
std::views::filter([&firstWhile](const int x) { return std::exchange(firstWhile, false) || !(x & 1); }) |
std::views::take_while([&firstTake](const int x) { return std::exchange(firstTake, false) || x < 6; });
So I make a new suggestion which avoids the problems, though it is dependent upon std::views::zip, which comes with C++23, or range-v3, which is (assuming using namespace std::views for simplicity):
auto r = zip(iota(0), v) |
filter([](const auto& it) { return it.first == 0 || !(it.second & 1); }) |
take_while([](const auto& it) { return it.first == 0 || it.second < 6; }) |
transform([](const auto& it) { return it.second; });
Demo:
Not sure I like the complexity of either this or the other solution given, but there you have it.

How to write `a >>= std::countr_zero(a);` in C? [duplicate]

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

Is there are a fast (O(1)) way to find the position of the set bit in a one-hot binary sequence? [duplicate]

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
return pos;
}
Any ideas how to squeeze some cycles out of it?
(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)
[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!
Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:
unsigned int v; // find the number of trailing zeros in 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
Helpful references:
"Using de Bruijn Sequences to Index a 1 in a Computer Word" - Explanation about why the above code works.
"Board Representation > Bitboards > BitScan" - Detailed analysis of this problem, with a particular focus on chess programming
Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)
ffs(3) - Linux man page
Name
ffs - find first bit set in a word
Synopsis
#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);
Description
The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.
Return Value
These functions return the position of the first bit set, or 0 if no bits are set in i.
Conforming to
4.3BSD, POSIX.1-2001.
Notes
BSD systems have a prototype in <string.h>.
There is an x86 assembly instruction (bsf) that will do it. :)
More optimized?!
Side Note:
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Here is a benchmark comparing several solutions:
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Anytime you have a branch, the CPU has to guess which branch will be taken. The instruction pipe is loaded with the instructions that lead down the guessed path. If the CPU has guessed wrong then the instruction pipe gets flushed, and the other branch must be loaded.
Consider the simple while loop at the top. The guess will be to stay within the loop. It will be wrong at least once when it leaves the loop. This WILL flush the instruction pipe. This behavior is slightly better than guessing that it will leave the loop, in which case it would flush the instruction pipe on every iteration.
The amount of CPU cycles that are lost varies highly from one type of processor to the next. But you can expect between 20 and 150 lost CPU cycles.
The next worse group is where you think your going to save a few iterations by splitting the value in to smaller pieces and adding several more branches. Each of these branches adds an additional opportunity to flush the instruction pipe and cost another 20 to 150 clock cycles.
Lets consider what happens when you look up a value in a table. Chances are the value is not currently in cache, at least not the first time your function is called. This means that the CPU gets stalled while the value is loaded from cache. Again this varies from one machine to the next. The new Intel chips actually use this as an opportunity to swap threads while the current thread is waiting for the cache load to complete. This could easily be more expensive than an instruction pipe flush, however if you are performing this operation a number of times it is likely to only occur once.
Clearly the fastest constant time solution is one which involves deterministic math. A pure and elegant solution.
My apologies if this was already covered.
Every compiler I use, except XCODE AFAIK, has compiler intrinsics for both the forward bitscan and the reverse bitscan. These will compile to a single assembly instruction on most hardware with no Cache Miss, no Branch Miss-Prediction and No other programmer generated stumbling blocks.
For Microsoft compilers use _BitScanForward & _BitScanReverse.
For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.
Additionally, please refrain from posting an answer and potentially misleading newcomers if you are not adequately knowledgeable about the subject being discussed.
Sorry I totally forgot to provide a solution.. This is the code I use on the IPAD which has no assembly level instruction for the task:
unsigned BitScanLow_BranchFree(unsigned value)
{
bool bwl = (value & 0x0000ffff) == 0;
unsigned I1 = (bwl * 15);
value = (value >> I1) & 0x0000ffff;
bool bbl = (value & 0x00ff00ff) == 0;
unsigned I2 = (bbl * 7);
value = (value >> I2) & 0x00ff00ff;
bool bnl = (value & 0x0f0f0f0f) == 0;
unsigned I3 = (bnl * 3);
value = (value >> I3) & 0x0f0f0f0f;
bool bsl = (value & 0x33333333) == 0;
unsigned I4 = (bsl * 1);
value = (value >> I4) & 0x33333333;
unsigned result = value + I1 + I2 + I3 + I4 - 1;
return result;
}
The thing to understand here is that it is not the compare that is expensive, but the branch that occurs after the compare. The comparison in this case is forced to a value of 0 or 1 with the .. == 0, and the result is used to combine the math that would have occurred on either side of the branch.
Edit:
The code above is totally broken. This code works and is still branch-free (if optimized):
int BitScanLow_BranchFree(ui value)
{
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
return i16 + i8 + i4 + i2 + i1 + i0;
}
This returns -1 if given 0. If you don't care about 0 or are happy to get 31 for 0, remove the i0 calculation, saving a chunk of time.
Inspired by this similar post that involves searching for a set bit, I offer the following:
unsigned GetLowestBitPos(unsigned value)
{
double d = value ^ (value - !!value);
return (((int*)&d)[1]>>20)-1023;
}
Pros:
no loops
no branching
runs in constant time
handles value=0 by returning an otherwise-out-of-bounds result
only two lines of code
Cons:
assumes little endianness as coded (can be fixed by changing the constants)
assumes that double is a real*8 IEEE float (IEEE 754)
Update:
As pointed out in the comments, a union is a cleaner implementation (for C, at least) and would look like:
unsigned GetLowestBitPos(unsigned value)
{
union {
int i[2];
double d;
} temp = { .d = value ^ (value - !!value) };
return (temp.i[1] >> 20) - 1023;
}
This assumes 32-bit ints with little-endian storage for everything (think x86 processors).
After 11 years we finally have countr_zero!
#include <bit>
#include <bitset>
#include <cstdint>
#include <iostream>
int main()
{
for (const std::uint8_t i : { 0, 0b11111111, 0b00011100, 0b00011101 }) {
std::cout << "countr_zero( " << std::bitset<8>(i) << " ) = "
<< std::countr_zero(i) << '\n';
}
}
Well done C++20
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):
if (0x0000FFFF & value) {
if (0x000000FF & value) {
if (0x0000000F & value) {
if (0x00000003 & value) {
if (0x00000001 & value) {
return 1;
} else {
return 2;
}
} else {
if (0x0000004 & value) {
return 3;
} else {
return 4;
}
}
} else { ...
} else { ...
} else { ...
Another method (modulus division and lookup) deserves a special mention here from the same link provided by #anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
According to the Chess Programming BitScan page and my own measurements, subtract and xor is faster than negate and mask.
(Note than if you are going to count the trailing zeros in 0, the method as I have it returns 63 whereas the negate and mask returns 0.)
Here is a 64-bit subtract and xor:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];
For reference, here is a 64-bit version of the negate and mask method:
unsigned long v; // find the number of trailing zeros in 64-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[64] =
{
0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea:
No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set.
If so, check if any of the first 8 are set.
if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
If C++11 is available for you, a compiler sometimes can do the task for you :)
constexpr std::uint64_t lssb(const std::uint64_t value)
{
return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1);
}
Result is 1-based index.
This is in regards of #Anton Tykhyy answer
Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:
constexpr uint32_t DeBruijnSequence[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
constexpr uint32_t ffs ( uint32_t value )
{
return DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
To get around the issue of 0x1 and 0x0 both returning 0 you can do:
constexpr uint32_t ffs ( uint32_t value )
{
return (!value) ? 32 : DeBruijnSequence[
(( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF)
>> 27];
}
but if the compiler can't or won't preprocess the call it will add a couple of cycles to the calculation.
Finally, if interested, here's a list of static asserts to check that the code does what is intended to:
static_assert (ffs(0x1) == 0, "Find First Bit Set Failure.");
static_assert (ffs(0x2) == 1, "Find First Bit Set Failure.");
static_assert (ffs(0x4) == 2, "Find First Bit Set Failure.");
static_assert (ffs(0x8) == 3, "Find First Bit Set Failure.");
static_assert (ffs(0x10) == 4, "Find First Bit Set Failure.");
static_assert (ffs(0x20) == 5, "Find First Bit Set Failure.");
static_assert (ffs(0x40) == 6, "Find First Bit Set Failure.");
static_assert (ffs(0x80) == 7, "Find First Bit Set Failure.");
static_assert (ffs(0x100) == 8, "Find First Bit Set Failure.");
static_assert (ffs(0x200) == 9, "Find First Bit Set Failure.");
static_assert (ffs(0x400) == 10, "Find First Bit Set Failure.");
static_assert (ffs(0x800) == 11, "Find First Bit Set Failure.");
static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure.");
static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure.");
static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure.");
static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure.");
static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure.");
static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure.");
static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure.");
static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure.");
static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure.");
static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure.");
static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure.");
static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure.");
static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure.");
static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure.");
static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure.");
static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure.");
static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure.");
static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure.");
static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure.");
static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");
Here is one simple alternative, even though finding logs is a bit costly.
if(n == 0)
return 0;
return log2(n & -n)+1; //Assuming the bit index starts from 1
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
if (value & 1073741824) return 31;
return 0; // no bits set
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
Think about how the compiler will translate this into ASM!
This unrolled "loop" will be quicker for 97% of the test cases than most of the algorithms posted in this thread!
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
recently I see that singapore's premier posted a program he wrote on facebook, there is one line to mention it..
The logic is simply "value & -value", suppose you have 0x0FF0, then,
0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)
If you have the resources, you can sacrifice memory in order to improve the speed:
static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };
unsigned GetLowestBitPos(unsigned value)
{
assert(value != 0); // handled separately
return bitPositions[value];
}
Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).
If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

How does the Catch2 GENERATE macro work internally?

Recently I learned about the GENERATE macro in Catch2 (from this video). And now I am curious about how it works internally.
Naively one would think that for a test case with k generators (by a generator I mean one GENERATE call site), Catch2 just runs each test case n1 * n2 * ... * nk times, where ni is the number of elements in the i-th generator, each time specifying a different combination of values from those k generators. Indeed, this naive specification seems to hold for a simple test case:
TEST_CASE("Naive") {
auto x = GENERATE(0, 1);
auto y = GENERATE(2, 3);
std::cout << "x = " << x << ", y = " << y << std::endl;
}
As expected, the output is:
x = 0, y = 2
x = 0, y = 3
x = 1, y = 2
x = 1, y = 3
which indicates the test case runs for 2 * 2 == 4 times.
However, it seems that catch isn't implementing it naively, as shown by the following case:
TEST_CASE("Depends on if") {
auto choice = GENERATE(0, 1);
int x = -1, y = -1;
if (choice == 0) {
x = GENERATE(2, 3);
} else {
y = GENERATE(4, 5);
}
std::cout << "choice = " << choice << ", x = " << x << ", y = " << y << std::endl;
}
In the above case, the actual invocation (not callsite) of GENERATE depends on choice. If the logic were implemented naively, one would expect there to be 8 lines of output (since 2 * 2 * 2 == 8):
choice = 0, x = 2, y = -1
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
choice = 1, x = -1, y = 5
Notice the duplicate lines: the naive permutation still permutes the value of a generator even if it is not actually invoked. For example, y = GENERATE(4, 5) is only invoked if choice == 1, however, even when choice != 1, the implementation still permutes the values 4 and 5, even if those are not used.
The actual output, though, is:
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
No duplicate lines. This leads me to suspect that Catch internally uses a stack to track the generators invoked and the order of their latest invocation. Each time a test case finishes one iteration, it traverses the invoked genrators in the reverse order, and advances each generator's value. If such advancement fails (i.e. the sequence of values inside the generator finishes), that generator is reset to its initial state (i.e. ready to emit the first value in sequence); otherwise (the advancement succeeded), the traversal bails out.
In psuedocode it would look like:
for each generator that is invoked in reverse order of latest invocation:
bool success = generator.moveNext();
if success: break;
generator.reset();
This explains the previous cases perfectly. But it does not explain this (rather obscure) one:
TEST_CASE("Non structured generators") {
int x = -1, y = -1;
for (int i = 0; i <= 1; ++i) {
x = GENERATE(0, 1);
if (i == 1) break;
y = GENERATE(2, 3);
}
std::cout << x << "," << y << std::endl;
}
One would expect this to run 4 == 2 * 2 times, and the output being:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
(The x changes before y since x = GENERATE(0, 1) is the last generator invoked)
However, this is not what catch actually does, this is what happens in reality:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
8 lines of output, which is the first four lines repeated twice.
So my question is, how exactly is GENERATE in Catch2 implemented? I am not looking particularly for detailed code, but a high-level description that could explain what I have seen in the previous examples.
Maybe you can try to see the code generated after the pre-processor using the -E option in GCC.
a.c:
GENERATE(0,1)
gcc -E -CC a.c
How to make G++ preprocessor output a newline in a macro?

Advanced rgb2hsv conversion Matlab to opnecv/C++ access to pixel value

I am building a program in objective C/C++ and openCV. I am pretty skilled in Objective C but new to C++.
I am building custom RGB2HSV algorithm. My algorithm is slightly different from the openCV library cvtColor(in, out, CV_RGB2HSV).
The one I try to translate form Matlab to opencV/C++ produces so clear HSV image that no additional filtering is needed before further processing. Code below – Matlab code is self-explanatory.
I try to translate it to C++/openCV function out of it but I hit the wall trying to access pixel values of the image. I am new to C++.
I read a lot on the ways how to access Mat structure but usually I obtain either bunch of letters in a place of zeros or a number typically something like this “\202 k g”. When I try to do any multiplication operations on the say \202 the result has nothing to do with math.
Please help me to properly access the pixel values. Also in current version using uchar won’t work because some values are outside 0-255 range.
The algorithm is not mine. I cannot even point the source but it gives clearly better results than stock RGB2HSV.
Also the algorithm below is for one pixel. It needs to be applied each pixel in the image so in final version it need to wrapped with for { for {}} loops.
I also wish to share this method with community so everyone can benefit from it and saving on pre-filtering.
Please help me translate it to C++ / openCV. If possible with the best practices speed wise. Or at least how to clearly access the pixel value so it is workable with range of mathematical equations. Thanks in advance.
function[H, S, V] = rgb2hsvPixel(R,G,B)
% Algorithm:
% In case of 8-bit and 16-bit images, `R`, `G`, and `B` are converted to the
% floating-point format and scaled to fit the 0 to 1 range.
%
% V = max(R,G,B)
% S = / (V - min(R,G,B)) / V if V != 0
% \ 0 otherwise
% / 60*(G-B) / (V - min(R,G,B)) if V=R
% H = | 120 + 60*(B-R) / (V - min(R,G,B)) if V=G
% \ 240 + 60*(R-G) / (V - min(R,G,B)) if V=B
%
% If `H<0` then `H=H+360`. On output `0<=V<=1`, `0<=S<=1`, `0<=H<=360`.
red = (double(R)-16)*255/224; % \
green = (double(G)-16)*255/224; % }- R,G,B (0 <-> 255) -> (-18.2143 <-> 272.0759)
blue = (min(double(B)*2,240)-16)*255/224; % /
minV = min(red,min(green,blue));
value = max(red,max(green,blue));
delta = value - minV;
if(value~=0)
sat = (delta*255) / value;% s
if (delta ~= 0)
if( red == value )
hue = 60*( green - blue ) / delta; % between yellow & magenta
elseif( green == value )
hue = 120 + 60*( blue - red ) / delta; % between cyan & yellow
else
hue = 240 + 60*( red - green ) / delta; % between magenta & cyan
end
if( hue < 0 )
hue = hue + 360;
end
else
hue = 0;
sat = 0;
end
else
% r = g = b = 0
sat = 0;
hue = 0;
end
H = max(min(floor(((hue*255)/360)),255),0);
S = max(min(floor(sat),255),0);
V = max(min(floor(value),255),0);
end
To access the value of a pixel in a 3-channel, 8-bit precision image (type CV_8UC3) you have to do it like this:
cv::Mat image;
cv::Vec3b BGR = image.at<cv::Vec3b>(i,j);
If, as you say, 8-bit precision and range are not enough, you can declare a cv::Mat of type CV_32F to store floating point 32-bit numbers.
cv::Mat image(height, width, CV_32FC3);
//fill your image with data
for(int i = 0; i < image.rows; i++) {
for(int j = 0; j < image.cols; j++) {
cv::Vec3f BGR = image.at<cv::Vec3f>(i,j)
//process your pixel
cv::Vec3f HSV; //your calculated HSV values
image.at<cv::Vec3f>(i,j) = HSV;
}
}
Be aware that OpenCV stores rgb values in the BGR order and not RGB. Take a look at OpenCV docs to learn more about it.
If you are concerned by performance and fairly comfortable with pixel indexes, you can use directly the Mat ptr.
For example:
cv::Mat img = cv::Mat::zeros(4, 8, CV_8UC3);
uchar *ptr_row_img;
int cpt = 0;
for(int i = 0; i < img.rows; i++) {
ptr_row_img = img.ptr<uchar>(i);
for(int j = 0; j < img.cols; j++) {
for(int c = 0; c < img.channels(); c++, cpt++, ++ptr_row_img) {
*ptr_row_img = cpt;
}
}
}
std::cout << "img=\n" << img << std::endl;
The previous code should print:
img= [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23; 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47; 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71; 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
The at access should be enough for most of the cases and is much more readable / less likely to make a mistake than using the ptr access.
References:
How to scan images, lookup tables and time measurement with OpenCV
C++: OpenCV: fast pixel iteration
Thanks everybody for help.
Thanks to your hints I constructed the custom rgb2hsv function C++/openCV.
From the top left respectively, edges after bgr->gray->edges, bgr->HSV->edges, bgr->customHSV->edges
Below each of them corresponding settings of the filters to achieve approximately the same clear results. The bigger the radius of a filter the more complex and time consuming computations.
It produces clearer edges in next steps of image processing.
It can be tweaked further experimenting with parameters in r g b channels:
red = (red-16)*1.1384; //255/244=1.1384
here 16 – the bigger the number the clearer V becomes
255/244 – also affect the outcome extending it beyond ranges 0-255, later to be clipped.
This numbers here seem to be golden ratio but anyone can adjust for specific needs.
With this function translating BGR to RGB can be avoided by directly connecting colors to proper channels in raw image.
Probably it is a little clumsy performance wise. In my case it serves in first step of color balance and histogram adjustment so speed is not that critical.
To use in constant processing video stream it need speed optimization, I think by using pointers and reducing loop complexity. Optimization is not exactly my cup of tea. So if someone helped to optimize it for the community that would be great.
Here it is ready to use:
Mat bgr2hsvCustom ( Mat& image )
{
//smallParam = 16;
for(int x = 0; x < image.rows; x++)
{
for(int y = 0; y<image.cols; y++)
{
//assigning vector to individual float BGR values
float blue = image.at<cv::Vec3b>(x,y)[0];
float green = image.at<cv::Vec3b>(x,y)[1];
float red = image.at<cv::Vec3b>(x,y)[2];
float sat, hue, minValue, maxValue, delta;
float const ang0 = 0; // func min and max don't accept varaible and number
float const ang240 = 240;
float const ang255 = 255;
red = (red-16)*1.1384; //255/244
green = (green-16)*1.1384;
blue = (min(blue*2,ang240)-16)*1.1384;
minValue = min(red,min(green,blue));
maxValue = max(red,max(green,blue));
delta = maxValue - minValue;
if (maxValue != 0)
{
sat = (delta*255) / maxValue;
if ( delta != 0)
{
if (red == maxValue){
hue = 60*(green - blue)/delta;
}
else if( green == maxValue ) {
hue = 120 + 60*( blue - red )/delta;
}
else{
hue = 240 + 60*( red - green )/delta;
}
if( hue < 0 ){
hue = hue + 360;
}
}
else{
sat = 0;
hue = 0;
}
}
else{
hue = 0;
sat = 0;
}
image.at<cv::Vec3b>(x,y)[0] = max(min(floor(maxValue),ang255),ang0); //V
image.at<cv::Vec3b>(x,y)[1] = max(min(floor(sat),ang255),ang0); //S
image.at<cv::Vec3b>(x,y)[2] = max(min(floor(((hue*255)/360)),ang255),ang0); //H
}
}
return image;
}