I'm looking for a branchless implementaton of the followng:
int f(int c) {
if (c == 0) {
return 0xffffffff; // all bits set
} else {
return c;
}
}
I haven't come across any clever ways to do this. Any tricks?
As mentioned by Nick ODell, there is a good chance that a compiler will already compile this code to instructions without a branch. A formulation making this even more likely is x - (x == 0) or x - !!x, which a compiler would typically be able to implement without branches by using CPU specific features. You can even try to replace this by a formulation purely based on bit manipulation. E.g. ((x - 1) & ~x) >> 31 (x unsigned) is 1 only if x == 0, and 0 otherwise. So
x - (((x - 1) & ~x) >> 31)
would be a completely branchless implementation of f. In practice I would expect it to be slower though than whatever the compiler generates for the other formulations.
Related
I've spent too many brain cycles on this over the last day.
I'm trying to come up with a set of bitwise operations that may re-implement the following condition:
uint8_t a, b;
uint8_t c, d;
uint8_t e, f;
...
bool result = (a == 0xff || a == b) && (c == 0xff || c == d) && (e == 0xff || e == f);
Code I'm looking at has four of these expressions, short-circuit &&ed together (as above).
I know this is an esoteric question, but the short-circuit nature of this and the timing of the above code in a tight loop makes the lack of predictable time a royal pain, and quite frankly, it seems to really suck on architectures where branch prediction isn't available, or so well implemented.
Is there such a beast that would be concise?
So, if you really want to do bit-twiddling to make this "fast" (which you really should only do after profiling your code to make sure this is a bottleneck), what you want to do is vectorize this by packing all the values together into a wider word so you can do all the comparisons at once (one instruction), and then extract the answer from a few bits.
There are a few tricks to this. To compare two value for equality, you can xor (^) them and test to see if the result is zero. To test a field of a wider word to see if it is zero, you can 'pack' it with a 1 bit above, then subtract one and see if the extra bit you added is still 1 -- if it is now 0, the value of the field was zero.
Putting all this together, you want to do 6 8-bit compares at once. You can pack these values into 9 bit fields in a 64-bit word (9 bits to get that extra 1 guard bit your going to test for subtraction). You can fit up to 7 such 9 bit fields in a 64 bit int, so no problem
// pack 6 9-bit values into a word
#define VEC6x9(A,B,C,D,E,F) (((uint64_t)(A) << 45) | ((uint64_t)(B) << 36) | ((uint64_t)(C) << 27) | ((uint64_t)(D) << 18) | ((uint64_t)(E) << 9) | (uint64_t)(F))
// the two values to compare
uint64_t v1 = VEC6x9(a, a, c, c, e, e);
uint64_t v2 = VEC6x9(b, 0xff, d, 0xff, f, 0xff);
uint64_t guard_bits = VEC6x9(0x100, 0x100, 0x100, 0x100, 0x100, 0x100);
uint64_t ones = VEC6x9(1, 1, 1, 1, 1, 1);
uint64_t alt_guard_bits = VEC6x9(0, 0x100, 0, 0x100, 0, 0x100);
// do the comparisons in parallel
uint64_t res_vec = ((v1 ^ v2) | guard_bits) - ones;
// mask off the bits we'll ignore (optional for clarity, not needed for correctness)
res_vec &= ~guard_bits;
// do the 3 OR ops in parallel
res_vec &= res_vec >> 9;
// get the result
bool result = (res_vec & alt_guard_bits) == 0;
The ORs and ANDs at the end are 'backwards' becuase the result bit for each comparison is 0 if the comparison was true (values were equal) and 1 if it was false (values were not equal.)
All of the above is mostly of interest if you are writing a compiler -- its how you end up implementing a vector comparison -- and it may well be the case that a vectorizing compiler will do it all for you automatically.
This can be much more efficient if you can arrange to have your initial values pre-packed into vectors. This may in turn influence your choice of data structures and allowable values -- if you arrange for your values to be 7-bit or 15-bit (instead of 8-bit) they may pack nicer when you add the guard bits...
You could modify how you store and interpret the data:
When a if 0xFF, do you need the value of b. If not, then make b equal to 0xFF and simplify the expression by removing the part that test for 0xFF.
Also, you might combine a, b and c in a single variable.
uint32_t abc;
uint32_t def;
bool result = abc == def;
Other operations might be slower but that loop should be much faster (single comparison instead of up to 6 comparisons).
You might want to use an union to be able to access byte individually or in group. In that case, make sure that the forth byte is always 0.
To remove timing variations with &&, ||, use &, |. #molbdnilo. Possible faster, maybe not. Certainly easier to parallel.
// bool result = (a == 0xff || a == b) && (c == 0xff || c == d)
// && (e == 0xff || e == f);
bool result = ((a == 0xff) | (a == b)) & ((c == 0xff) | (c == d))
& ((e == 0xff) | (e == f));
I have a liking to finding shortest methods for coding. I have found a need for a method for calculating the sum of the digits(or the number of 1s in a number) of a number represented in binary. I have used bit operators and found this:
r=1;while(a&=a-1)r++;
where a is the number, and r is the count. a is a given integer. Is there any way to shorten this/improve the algorithm?
Shortest as in shortest length of source code.
Your solution assumes a to have an unsigned type.
Yet the code does not work for a = 0. You can fix it this way:
r=!!a;while(a&=a-1)r++;
You can shave one character off this way:
for(r=!!a;a&=a-1;r++);
But here is an alternative solution with the same source length:
for(r=0;a;a/=2)r+=a&1;
As Lundin mentioned, code golfing is off topic on Stack Overflow. It is a fun game, and one can definitely hone his C skills at trying to make the smallest code that is still fully defined for a given problem, but the resulting code is of poor value to casual readers trying to program at a more basic level.
Regarding the on topic part of your question, The quickest method to compute the number of bits in an integer: this problem has been studied intensively and several methods are available. Which one is fastest depends on many factors:
how portable the code need to be. Some processors have built-in instructions for this and the compiler may provide a way to generate them via intrinsics or inline assembly.
the expected range of values for the argument. If the range is small, a simple lookup table may yield the best performance.
the distribution of values of the argument: if a specific value is almost always given, just testing for it might be the fastest solution.
the cpu specific performance: different algorithms use different instructions, the relative performance of different cpus may differ.
Only careful benchmarking will tell you if a given approach is preferable to another, or if you are trying to optimise code whose performance is irrelevant. Provable correctness is much more important than micro-optimisation. Many experts consider optimisation to always be premature.
An interesting solution for 32-bit integers is this:
uint32_t bitcount_parallel(uint32_t v) {
uint32_t c = v - ((v >> 1) & 0x55555555);
c = ((c >> 2) & 0x33333333) + (c & 0x33333333);
c = ((c >> 4) + c) & 0x0F0F0F0F;
c = ((c >> 8) + c) & 0x00FF00FF;
return ((c >> 16) + c) & 0x0000FFFF;
}
If multiplication is fast, here is a potentially faster solution:
uint32_t bitcount_hybrid(uint32_t v) {
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
return ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
Different solutions are detailed here: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
The fastest possible code is to generate a look-up table, with the value of the variable as index. Example for uint8_t:
const uint8_t NUMBER_OF_ONES [256] =
{
0, // 0
1, // 1
1, // 2
2, // 3
1, // 4
2, // 5
...
8, // 255
};
You would use it as n = NUMBER_OF_ONES[a];.
The second fastest is to generate smaller look-up tables, to save ROM. For example a nibble-wise look-up for numbers 0 to 15, which you would then call for every nibble in the data type.
Note that the requirement "Shortest as in shortest length of source code." is nonsense, that's not a metric used by professionals. If that's truly what you are after, for the sake of fun or obfuscation, then the question is off-topic on SO and should be asked at https://codegolf.stackexchange.com instead.
For my implementation of the minhashing algorithm I need to make many random permutations of integers, which will be simulated by using random hash functions (as many as possible). Currently I use hash functions of the form:
h(x) = (a*x + b) % c
where a and b are randomly generated numbers, and c is a prime number bigger than the highest value of b. Anyways, the code runs way too slow and it is impossible to use more than 15 of such hash functions in reasonable running time. Can anyone recommend other ways of using random hash functions for integers in Python? In other posts I came across suggestions for using bitwise shuffling and an XOR operation, but I didn't fully understand how one should implement something like this (I'm relatively new to Python).
Borrowing from my answer to a similar question, and having a quick look at Python documentation to try to guess valid syntax...
The code you posted is OK but it's probably subject to being computed in longer precision than is optimal, and it involves a division which also makes things slow.
To make it faster, you can fix c at a power of two, and you can use binary & (and) instead of modulo, which gives you this:
h(x) = (a * x + b) & ((1 << 32) - 1)
which is the same as:
h(x) = (a * x + b) & (4294967296 - 1)
which is the same as:
h(x) = (a * x + b) % 4294967296
and you must ensure that a is an odd number (this is all that's needed to make it co-prime with c when c is a power of two). This example limits the output range to a 32-bit integer. You can change that as you see fit. I don't know what Python's limits are.
If you want more parameterisation than that, or you discover that the results aren't "random" enough (it would fail statistical tests very quickly, but that usually doesn't matter), then you can add more operations; but you can't add more of those operations because a chain of adds and multiplies will always simplify to just one pair of add and multiply, so the extra operations wouldn't fix anything.
What you can do instead is to use bit shifts and exclusive-or to break up the linearity; like so:
def h(x):
x = x ^ (x >> 16)
x = (a * x + b) & ((1 << 32) - 1)
x = x ^ (x >> 16)
x = (c * x + d) & ((1 << 32) - 1)
x = x ^ (x >> 16)
return x
You can experiment with variations on that if you want. If you set b and d to zero and change the middle 16 to 13 then you get the MurmurHash3 finaliser construction, which is near enough to ideal for most purposes provided you pick good a and c (sadly they can't just be random).
I come here to ask for tricks. I've got a 32-bit integer (that's 4 bytes). I want to test zero for each byte, and return true if one of them is true.
E.g.
int c1 = 0x01020304
cout<<test(c1)<<endl; // output false
int c2 = 0x00010203
cout<<test(c2)<<endl; // output true
int c3 = 0xfffefc00
cout<<test(c3)<<endl; // output true
Are there any tricks to do it in the least number of CPU cycles?
There are several ways in the famous bithacks page
bool hasZeroByte(unsigned int v)
{
return ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
}
or
bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
if (hasZeroByte) // or may just have 0x80 in the high byte
{
hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
}
And the likely most compact way when compiling to assembly
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
As they're tricks, they're hard to understand so if you want clarity, mask out each byte and check like in dasblinkenlight's answer
Example assembly output on Compiler Explorer
You can test it by masking each of the bytes in an & operation, and comparing the result to zero:
bool hasZeroByte(int32_t n) {
return !(n & 0x000000FF)
|| !(n & 0x0000FF00)
|| !(n & 0x00FF0000)
|| !(n & 0xFF000000);
}
The fastest way to do this is probably to use strnlen, since most compilers will have optimized this to use low level instructions for finding zero bytes in strings.
bool hasZeroByte(int32_t n) {
return strnlen(reinterpret_cast<char *>(&n), 4) < 4;
}
If you want to be a little more explicit, you could use the memchr function which is documented to do exactly what you are asking:
bool hasZeroByte(int32_t n) {
return memchr(reinterpret_cast<void *>(&n), 0, 4) != nullptr;
}
For those who don't believe this answer, feel free to take a look at the glibc implementation of strlen and see that it is already doing all of the mentioned bit twiddling tricks in the other answers.
See also:
http://www.strchr.com/optimized_strlen_function
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2
http://www.int80h.org/strlen/
I am using an unsigned char to store 8 flags. Each flag represents the corner of a cube. So 00000001 will be corner 1 01000100 will be corners 3 and 7 etc. My current solution is to & the result with 1,2,4,8,16,32,64 and 128, check whether the result is not zero and store the corner. That is, if (result & 1) corners.push_back(1);. Any chance I can get rid of that 'if' statement? I was hoping I could get rid of it with bitwise operators but I could not think of any.
A little background on why I want to get rid of the if statement. This cube is actually a Voxel which is part of a grid that is at least 512x512x512 in size. That is more than 134 million Voxels. I am performing calculations on each one of the Voxels (well, not exactly, but I won't go into too much detail as it is irrelevant here) and that is a lot of calculations. And I need to perform these calculations per frame. Any speed boost that is minuscule per function call will help with these amount of calculations. To give you an idea, my algorithm (at some point) needed to determine whether a float was negative, positive or zero (within some error). I had if statements in there and greater/smaller than checks. I replaced that with a fast float to int function and shaved of a quarter of a second. Currently, each frame in a 128x128x128 grid takes a little more than 4 seconds.
I would consider a different approach to it entirely: there are only 256 possibilities for different combinations of flags. Precalculate 256 vectors and index into them as needed.
std::vector<std::vector<int> > corners(256);
for (int i = 0; i < 256; ++i) {
std::vector<int>& v = corners[i];
if (i & 1) v.push_back(1);
if (i & 2) v.push_back(2);
if (i & 4) v.push_back(4);
if (i & 8) v.push_back(8);
if (i & 16) v.push_back(16);
if (i & 32) v.push_back(32);
if (i & 64) v.push_back(64);
if (i & 128) v.push_back(128);
}
for (int i = 0; i < NumVoxels(); ++i) {
unsigned char flags = GetFlags(i);
const std::vector& v = corners[flags];
... // do whatever with v
}
This would avoid all the conditionals and having push_back call new which I suspect would be more expensive anyway.
If there's some operation that needs to be done if the bit is set and not if it's not, it seems you'll have to have a conditional of some kind somewhere. If it could be expressed as a calculation somehow, you could get around it like this, for example:
numCorners = ((result >> 0) & 1) + ((result >> 1) & 1) + ((result >> 2) & 1) + ...
Hackers's Delight, first page:
x & (-x) // isolates the lowest set bit
x & (x - 1) // clears the lowest set bit
Inlining your push_back method would also help (better create a function that receives all the flags together).
Usually if you need performance, you should design the whole system with that in mind. Maybe if you post more code it will be easier to help.
EDIT: here is a nice idea:
unsigned char LOG2_LUT[256] = {...};
int t;
switch (count_set_bits(flags)){
case 8: t = flags;
flags &= (flags - 1); // clearing a bit that was set
t ^= flags; // getting the changed bit
corners.push_back(LOG2_LUT[t]);
case 7: t = flags;
flags &= (flags - 1);
t ^= flags;
corners.push_back(LOG2_LUT[t]);
case 6: t = flags;
flags &= (flags - 1);
t ^= flags;
corners.push_back(LOG2_LUT[t]);
// etc...
};
count_set_bits() is a very known function: http://www-graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
There is a way, it's not "pretty", but it works.
(result & 1) && corners.push_back(1);
(result & 2) && corners.push_back(2);
(result & 4) && corners.push_back(3);
(result & 8) && corners.push_back(4);
(result & 16) && corners.push_back(5);
(result & 32) && corners.push_back(6);
(result & 64) && corners.push_back(7);
(result & 128) && corners.push_back(8);
it uses a seldom known feature of the C++ language: the boolean shortcut.
I've noted a similar algorithm in the OpenTTD code. It turned out to be utterly useless: you're faster off by not breaking down numbers like that. Instead, replace the iteration over the vector<> you have now by an iteration over the bits of the byte. This is far more cache-friendly.
I.e.
unsigned char flags = Foo(); // the value you didn't put in a vector<>
for (unsigned char c = (UCHAR_MAX >> 1) + 1; c !=0 ; c >>= 1)
{
if (flags & c)
Bar(flags&c);
}