Related
Why do x+0 and x|0 have different results?
Below is my code.
My environment is WSL (Debian Sid) + GCC 10.2.1.
#include <stdio.h>
/**
* Do rotating left shift. Assume 0 <= n < w
* Examples when x = 0x12345678 and w = 32:
* n = 4 -> 0x23456781, n = 20 -> 0x67812345
*/
unsigned rotate_left(const unsigned x, const int n)
{
const int w = sizeof(unsigned) << 3;
return (x << n) + (x >> (w - n));
}
int main()
{
int x = 0x12345678;
printf("n = 4, %#x -> %#x\n", x, rotate_left(x, 4));
printf("n = 20, %#x -> %#x\n", x, rotate_left(x, 20));
printf("n = 0, %#x -> %#x\n", x, rotate_left(x, 0));
}
when n = 0, the result is 0x2468acf0.
when i replace return (x << n) + (x >> (w - n)) with return (x << n) | (x >> (w - n)), i get 0x12345678.
If x is an integer type with at most 32 bits, then x>>32 is Undefined Behaviour, which means that the result can be absolutely anything (and it can be a different anything in different programs). (Also true of x<<32.) [Note 1]
From ยง6.5.7 para 3 of the C standard, concerning << and >> operators, emphasis added:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
As a result, your rotate function won't work if n <= 0 or n >= w, and you should test for those cases rather than assume they won't occur (since they obviously will).
Notes
In practice, on Intel hardware (and probably others), a shift by the width of the operand is a no-op, not "clear to 0". So x>>32 is x, as is x<<0, and thus the sum is twice x, while the bitwise or is exactly x. But you cannot rely on that fact, because it's Undefined Behaviour, and compiler optimizations might result in other arbitrary results.
This function is used to calculate the XOR of a 32 bit integer
int xor32int(int x, int y)
{
int res = 0; // Initialize result
// Assuming 32-bit Integer
for (int i = 31; i >= 0; i--)
{
// Find current bits in x and y
bool b1 = x & (1 << i);
bool b2 = y & (1 << i);
// If both are 1 then 0 else xor is same as OR
bool xoredBit = (b1 & b2) ? 0 : (b1 | b2);
// Update result
res <<= 1;
res |= xoredBit;
}
return res;
}
This works fine when XOR'ing 8 bit values, but they first need to be casted to int, i.e.
char byte1 = 0x23, byte2 = 0x34;
int result = xor32int((int)byte1, (int)byte2);
And being that xor32int() assumes input to be 32 bit integers it runs a loop 32 times, so even if value is only 8 bit it runs extra loops when unnecessary resulting in a major decrease in performance.
How would I go about converting the xor32int() function so it only works with 8 bit values so it doesn't need to loop 32 times?
If you are wondering why don't I simply use the XOR operator, it is because I am working with an old machine that uses a processor that doesn't support XOR.
Is there a reason you can't use (x | y) & ~(x & y)? That's one definition of xor. You can write it as a function:
char xor8(char x, char y) {
return (x | y) & ~(x & y);
}
You can even write it as a function template:
template<typename T>
T xorT(T x, T y) {
return (x | y) & ~(x & y);
}
In case you can't use that for some reason, I'm pretty sure you can replace int with char, and 31 with 7:
char xor8char(char x, char y)
{
char res = 0;
for (int i = 7; i >= 0; i--)
{
bool b1 = x & (1 << i);
bool b2 = y & (1 << i);
bool xoredBit = (b1 & b2) ? 0 : (b1 | b2);
res <<= 1;
res |= xoredBit;
}
return res;
}
All of that, live on Coliru.
I want a function
int rounded_division(const int a, const int b) {
return round(1.0 * a/b);
}
So we have, for example,
rounded_division(3, 2) // = 2
rounded_division(2, 2) // = 1
rounded_division(1, 2) // = 1
rounded_division(0, 2) // = 0
rounded_division(-1, 2) // = -1
rounded_division(-2, 2) // = -1
rounded_division(-3, -2) // = 2
Or in code, where a and b are 32 bit signed integers:
int rounded_division(const int a, const int b) {
return ((a < 0) ^ (b < 0)) ? ((a - b / 2) / b) : ((a + b / 2) / b);
}
And here comes the tricky part: How to implement this guy efficiently (not using larger 64 bit values) and without a logical operators such as ?:, &&, ...? Is it possible at all?
The reason why I am wondering of avoiding logical operators, because the processor I have to implement this function for, has no conditional instructions (more about missing conditional instructions on ARM.).
a/b + a%b/(b/2 + b%2) works quite well - not failed in billion+ test cases. It meets all OP's goals: No overflow, no long long, no branching, works over entire range of int when a/b is defined.
No 32-bit dependency. If using C99 or later, no implementation behavior restrictions.
int rounded_division(int a, int b) {
int q = a / b;
int r = a % b;
return q + r/(b/2 + b%2);
}
This works with 2's complement, 1s' complement and sign-magnitude as all operations are math ones.
How about this:
int rounded_division(const int a, const int b) {
return (a + b/2 + b * ((a^b) >> 31))/b;
}
(a ^ b) >> 31 should evaluate to -1 if a and b have different signs and 0 otherwise, assuming int has 32 bits and the leftmost is the sign bit.
EDIT
As pointed out by #chux in his comments this method is wrong due to integer division. This new version evaluates the same as OP's example, but contains a bit more operations.
int rounded_division(const int a, const int b) {
return (a + b * (1 + 2 * ((a^b) >> 31)) / 2)/b;
}
This version still however does not take into account the overflow problem.
What about
...
return ((a + (a*b)/abs(a*b) * b / 2) / b);
}
Without overflow:
...
return ((a + ((a/abs(a))*(b/abs(b))) * b / 2) / b);
}
This is a rough approach that you may use. Using a mask to apply something if the operation a*b < 0.
Please note that I did not test this appropriately.
int function(int a, int b){
int tmp = float(a)/b + 0.5;
int mask = (a*b) >> 31; // shift sign bit to set rest of the bits
return tmp - (1 & mask);//minus one if a*b was < 0
}
The following rounded_division_test1() meets OP's requirement of no branching - if one counts sign(int a), nabs(int a), and cmp_le(int a, int b) as non-branching. See here for ideas of how to do sign() without compare operators. These helper functions could be rolled into rounded_division_test1() without explicit calls.
The code demonstrates the correct functionality and is useful for testing various answers. When a/b is defined, this answer does not overflow.
#include <limits.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
int nabs(int a) {
return (a < 0) * a - (a >= 0) * a;
}
int sign(int a) {
return (a > 0) - (a < 0);
}
int cmp_le(int a, int b) {
return (a <= b);
}
int rounded_division_test1(int a, int b) {
int q = a / b;
int r = a % b;
int flag = cmp_le(nabs(r), (nabs(b) / 2 + nabs(b % 2)));
return q + flag * sign(b) * sign(r);
}
// Alternative that uses long long
int rounded_division_test1LL(int a, int b) {
int c = (a^b)>>31;
return (a + (c*2 + 1)*1LL*b/2)/b;
}
// Reference code
int rounded_division(int a, int b) {
return round(1.0*a/b);
}
int test(int a, int b) {
int q0 = rounded_division(a, b);
//int q1 = function(a,b);
int q1 = rounded_division_test1(a, b);
if (q0 != q1) {
printf("%d %d --> %d %d\n", a, b, q0, q1);
fflush(stdout);
}
return q0 != q1;
}
void tests(void) {
int err = 0;
int const a[] = { INT_MIN, INT_MIN + 1, INT_MIN + 1, -3, -2, -1, 0, 1, 2, 3,
INT_MAX - 1, INT_MAX };
for (unsigned i = 0; i < sizeof a / sizeof a[0]; i++) {
for (unsigned j = 0; j < sizeof a / sizeof a[0]; j++) {
if (a[j] == 0) continue;
if (a[i] == INT_MIN && a[j] == -1) continue;
err += test(a[i], a[j]);
}
}
printf("Err %d\n", err);
}
int main(void) {
tests();
return 0;
}
Let me give my contribution:
What about:
int rounded_division(const int a, const int b) {
return a/b + (2*(a%b))/b;
}
No branch, no logical operators, only mathematical operators. But it could fail if b is great than INT_MAX/2 or less than INT_MIN/2.
But if 64 bits are allowed to compute 32 bits rounds. It will not fail
int rounded_division(const int a, const int b) {
return a/b + (2LL*(a%b))/b;
}
Code that I came up with for use on ARM M0 (no floating point, slow divide).
It only uses one divide instruction and no conditionals, but will overflow if numerator + (denominator/2) > INT_MAX.
Cycle count on ARM M0 = 7 cycles + the divide (M0 has no divide instruction, so it is toolchain dependant).
int32_t Int32_SignOf(int32_t val)
{
return (+1 | (val >> 31)); // if v < 0 then -1, else +1
}
uint32_t Int32_Abs(int32_t val)
{
int32_t tmp = val ^ (val >> 31);
return (tmp - (val >> 31));
// the following code looks like it should be faster, using subexpression elimination
// except on arm a bitshift is free when performed with another operation,
// so it would actually end up being slower
// tmp = val >> 31;
// dst = val ^ (tmp);
// dst -= tmp;
// return dst;
}
int32_t Int32_DivRound(int32_t numerator, int32_t denominator)
{
// use the absolute (unsigned) demominator in the fudge value
// as the divide by 2 then becomes a bitshift
int32_t sign_num = Int32_SignOf(numerator);
uint32_t abs_denom = Int32_Abs(denominator);
return (numerator + sign_num * ((int32_t)(abs_denom / 2u))) / denominator;
}
since the function seems to be symmetric how about sign(a/b)*floor(abs(a/b)+0.5)
I am working through a problem which i was able to solve, all but for the last piece - i am not sure how can one do multiplication using bitwise operators:
0*8 = 0
1*8 = 8
2*8 = 16
3*8 = 24
4*8 = 32
Can you please recommend an approach to solve this?
To multiply by any value of 2 to the power of N (i.e. 2^N) shift the bits N times to the left.
0000 0001 = 1
times 4 = (2^2 => N = 2) = 2 bit shift : 0000 0100 = 4
times 8 = (2^3 -> N = 3) = 3 bit shift : 0010 0000 = 32
etc..
To divide shift the bits to the right.
The bits are whole 1 or 0 - you can't shift by a part of a bit thus if the number you're multiplying by is does not factor a whole value of N
ie.
since: 17 = 16 + 1
thus: 17 = 2^4 + 1
therefore: x * 17 = (x * 16) + x in other words 17 x's
thus to multiply by 17 you have to do a 4 bit shift to the left, and then add the original number again:
==> x * 17 = (x * 16) + x
==> x * 17 = (x * 2^4) + x
==> x * 17 = (x shifted to left by 4 bits) + x
so let x = 3 = 0000 0011
times 16 = (2^4 => N = 4) = 4 bit shift : 0011 0000 = 48
plus the x (0000 0011)
ie.
0011 0000 (48)
+ 0000 0011 (3)
=============
0011 0011 (51)
Edit: Update to the original answer. Charles Petzold has written a fantastic book 'Code' that will explain all of this and more in the easiest of ways. I thoroughly recommend this.
To multiply two binary encoded numbers without a multiply instruction.
It would be simple to iteratively add to reach the product.
unsigned int mult(x, y)
unsigned int x, y;
{
unsigned int reg = 0;
while(y--)
reg += x;
return reg;
}
Using bit operations, the characteristic of the data encoding can be exploited.
As explained previously, a bit shift is the same as multiply by two.
Using this an adder can be used on the powers of two.
// multiply two numbers with bit operations
unsigned int mult(x, y)
unsigned int x, y;
{
unsigned int reg = 0;
while (y != 0)
{
if (y & 1)
{
reg += x;
}
x <<= 1;
y >>= 1;
}
return reg;
}
You'd factor the multiplicand into powers of 2.
3*17 = 3*(16+1) = 3*16 + 3*1
... = 0011b << 4 + 0011b
public static int multi(int x, int y){
boolean neg = false;
if(x < 0 && y >= 0){
x = -x;
neg = true;
}
else if(y < 0 && x >= 0){
y = -y;
neg = true;
}else if( x < 0 && y < 0){
x = -x;
y = -y;
}
int res = 0;
while(y!=0){
if((y & 1) == 1) res += x;
x <<= 1;
y >>= 1;
}
return neg ? (-res) : res;
}
I believe this should be a left shift. 8 is 2^3, so left shift 3 bits:
2 << 3 = 8
-(int)multiplyNumber:(int)num1 withNumber:(int)num2
{
int mulResult =0;
int ithBit;
BOOL isNegativeSign = (num1<0 && num2>0) || (num1>0 && num2<0) ;
num1 = abs(num1);
num2 = abs(num2);
for(int i=0;i<sizeof(num2)*8;i++)
{
ithBit = num2 & (1<<i);
if(ithBit>0){
mulResult +=(num1<<i);
}
}
if (isNegativeSign) {
mulResult = ((~mulResult)+1 );
}
return mulResult;
}
I have just realized that this is the same answer as the previous one. LOL sorry.
public static uint Multiply(uint a, uint b)
{
uint c = 0;
while(b > 0)
{
c += ((b & 1) > 0) ? a : 0;
a <<= 1;
b >>= 1;
}
return c;
}
I was working on a recursive multiplication problem without the * operator and came up with a solution that was informed by the top answer here.
I thought it was worth posting because I really like the explanation in the top answer here, but wanted to expand on it in a way that:
Had a function representation.
Handled cases where your "remainder" was arbitrary.
This only handles positive integers, but you could wrap it in a check for negatives like some of the other answers.
def rec_mult_bitwise(a,b):
# Base cases for recursion
if b == 0:
return 0
if b == 1:
return a
# Get the most significant bit and the power of two it represents
msb = 1
pwr_of_2 = 0
while True:
next_msb = msb << 1
if next_msb > b:
break
pwr_of_2 += 1
msb = next_msb
if next_msb == b:
break
# To understand the return value, remember:
# 1: Left shifting by the power of two is the same as multiplying by the number itself (ie x*16=x<<4)
# 2: Once we've done that, we still need to multiply by the remainder, hence b - msb
return (a << pwr_of_2) + rec_mult_bitwise(a, b - msb)
Using Bitwise operator reduces the time complexity.
In cpp:
#include<iostream>
using name space std;
int main(){
int a, b, res = 0; // read the elements
cin>>a>>b;
// find the small number to reduce the iterations
small = (a<b)?a:b; // small number using terinary operator
big = (small^a)?a:b; // big number using bitwise XOR operator
while(small > 0)
{
if(small & 1)
{
res += big;
}
big = big << 1; // it increases the number << is big * (2 power of big)
small = small >> 1; // it decreases the number >> is small / (2 power of small)
}
cout<<res;
}
In Python:
a = int(input())
b = int(input())
res = 0
small = a if(a < b) else b
big = a if(small ^ a) else b
def multiplication(small, big):
res = 0
while small > 0:
if small & 1:
res += big
big = big << 1
small = small >> 1
return res
answer = multiplication(small, big)
print(answer)
def multiply(x, y):
return x << (y >> 1)
You would want to halve the value of y, hence y shift bits to the right once (y >> 1) and shift the bits again x times to the left to get your answer x << (y >> 1).
I have an array of uint64 and for all unset bits (0s), I do some evaluations.
The evaluations are not terribly expensive, but very few bits are unset. Profiling says that I spend a lot of time in the finding-the-next-unset-bit logic.
Is there a faster way (on a Core2duo)?
My current code can skip lots of high 1s:
for(int y=0; y<height; y++) {
uint64_t xbits = ~board[y];
int x = 0;
while(xbits) {
if(xbits & 1) {
... with x and y
}
x++;
xbits >>= 1;
}
}
(And any discussion about how/if to SIMD/CUDA-ise this would be an intriguing tangent!)
Hacker's Delight suggests a loop-unrolled binary search. Not pretty, but fast for sparse unset bits because it skips dwords/bytes/nibbles/etc. with every bit set.
If you can get a Phenom with SSE4a (not Core2 Duo, unfortunately) you can use POPCNT to write a fast number-of-set-bits function. Then you can get the index of the next unset bit with:
pop(x & (~x-1))
x & (~x-1) clears the set bits above the next zero bit; then you just have to count the remaining bits with POPCNT.
Here's a worked example with a byte:
01101111 x
10010000 ~x
10001111 ~x-1
00001111 x & ~x-1
pop(00001111) => 4
Have you considered a table, which would allow you to handle every byte at once. Essentially by a s single subscript operation you'd retrieve a list of the "x" values that are not set in the byte (to which you'd add 8 * byte-within-uint64 to get your true "x".
By using one byte to store one 1-to-8 bit number value (we could pack this a bit but then the benefit of having a good to go value would be somewhat defeated), and by assuming that we'd have a maximum of 4 0-valued bits (byte values with more 0 bits can be coded with an escape code, which would trigger some conventional bit logic, which would be acceptable then, given the low probability of such events), we need a table of 256 * 4 bytes = 1k.
If you're willing to use assemply, BSF (Bit Scan Forward) would be the operation to use. It finds 1 bits though, so you'll have to invert your bitmask. IIRC, the XOR will set the zero flag if the result is 0, so you can test that flag before trying BSF. On x86, BSF works on 32 bits registers, so you'll have to split your value. (But then you should be using 32 bits integers in the first place, I'd say).
I can think of a few optimization points like Loop unwinding, in which you can try something like
for(int y=0; y < height; y++) {
uint64_t xbits = ~board[y];
int x = 0;
while(xbits) {
if(xbits & (1 << 0)) {
... with x and y
}
if(xbits & (1 << 1)) {
... with x and y
}
if(xbits & (1 << 2)) {
... with x and y
}
if(xbits & (1 << 3)) {
... with x and y
}
if(xbits & (1 << 4)) {
... with x and y
}
if(xbits & (1 << 5)) {
... with x and y
}
if(xbits & (1 << 6)) {
... with x and y
}
if(xbits & (1 << 7)) {
... with x and y
}
x+=8;
xbits >>= 8;
}
}
This will remove 7 loop check, 7 additions, 7 shifting for 8 calculations ...
Another way I can think of, is just ignoring consecutive 1's if they are set for example
while (xbits) {
if (xbits & 0xF) {
// Process for the four bits !!!
}
xbits >>= 4;
}
Warning: If the bits are too much scattered then the above method might make things slow :-(
One approach - partition into nibbles then use switch to select the bits from the nibble. Use templates so the bit selected is known at compile time, and to help unwind the code.
template < int i, int x >
struct process_bit {
inline static void apply ( int y ) { };
};
template < int x >
struct process_bit < 1, x > {
inline static void apply ( int y ) {
evaluate ( x, y );
}
};
template < int x, int n >
inline void process_nibble_bits ( int y ) {
process_bit < x & 1, n >::apply( y );
process_bit < ( x >> 1 ) & 1, n + 1 > ::apply( y );
process_bit < ( x >> 2 ) & 1, n + 2 > ::apply( y );
process_bit < ( x >> 3 ) & 1, n + 3 > ::apply( y );
}
template < int n >
inline void process_nibble ( uint64_t xbits, int y ) {
uint64_t nibble = ( xbits >> n ) & 0xf;
if ( nibble ) {
switch ( nibble ) {
case 0:
process_nibble_bits < 0, n > ( y );
break;
case 1:
process_nibble_bits < 1, n > ( y );
break;
case 2:
process_nibble_bits < 2, n > ( y );
break;
case 3:
process_nibble_bits < 3, n > ( y );
break;
case 4:
process_nibble_bits < 4, n > ( y );
break;
case 5:
process_nibble_bits < 5, n > ( y );
break;
case 6:
process_nibble_bits < 6, n > ( y );
break;
case 7:
process_nibble_bits < 7, n > ( y );
break;
case 8:
process_nibble_bits < 8, n > ( y );
break;
case 9:
process_nibble_bits < 9, n > ( y );
break;
case 10:
process_nibble_bits < 10, n > ( y );
break;
case 11:
process_nibble_bits < 11, n > ( y );
break;
case 12:
process_nibble_bits < 12, n > ( y );
break;
case 13:
process_nibble_bits < 13, n > ( y );
break;
case 14:
process_nibble_bits < 14, n > ( y );
break;
case 15:
process_nibble_bits < 15, n > ( y );
break;
}
}
}
template < int i, int n >
struct bit_tree {
inline static void apply ( uint64_t xbits, int y ) {
// each call to here represents scan of bits in [ n, n + 2i )
bit_tree < i >> 1, n > ::apply(xbits, y);
bit_tree < i >> 1, n + i > ::apply(xbits, y);
};
};
template < int i, int n >
struct bit_tree_with_guard {
inline static void apply ( uint64_t xbits, int y ) {
// each call to here represents scan of bits in [ n, n + 2i )
// so this branch to execute if any in [ n, n + i ) are set
if ( xbits & ( ( ( ( ( uint64_t ) 1LL ) << i ) - 1 ) << n ) )
bit_tree < i >> 1, n > ::apply(xbits, y);
if ( xbits & ( ( ( ( ( uint64_t ) 1LL ) << i ) - 1 ) << ( n + i) ) )
bit_tree < i >> 1, n + i > ::apply(xbits, y);
};
};
// put guards on 8 and 16 bit blocks ( for some reason using inheritance is slower )
template < int n >
struct bit_tree < 8, n > {
inline static void apply ( uint64_t xbits, int y ) {
bit_tree_with_guard < 8, n > ::apply ( xbits, y );
}
};
template < int n >
struct bit_tree < 16, n > {
inline static void apply ( uint64_t xbits, int y ) {
bit_tree_with_guard < 16, n > ::apply ( xbits, y );
}
};
template < int n >
struct bit_tree < 2, n > {
inline static void apply ( uint64_t xbits, int y ) {
process_nibble < n > ( xbits, y );
}
};
void template_nibbles(int height) {
for (int y = 0; y < height; y++) {
uint64_t xbits = ~board[y];
bit_tree< 32, 0>::apply ( xbits, y );
}
}
Running it's not as fast as the ffs version, but it is a touch better than the other portable ones, and appears to be consistent with them in results:
$ bin\bit_twiddle_micro_opt.exe
testing will_while()... 3375000 usecs (check 1539404233,1539597930)
testing will_ffs()... 2890625 usecs (check 675191567,1001386403)
testing alphaneo_unrolled_8()... 3296875 usecs (check 1539404233,1539597930)
testing template_nibbles()... 3046875 usecs (check 1539404233,1539597930)
Using a tree all the way doesn't seem to give any gain; not using the switch for the nibble is slower. Anyone know a way of not having to write the 16 cases by hand using only C++?
Other answers are good. Here's my contribution:
You could invert the word, and then have a loop finding the least-significant 1-bit:
int x = something;
int lsb = x ^ ((x-1) & x);
i.e. if x = 100100
a = (x - 1) = 100011 // these two steps turn off the lsb
b = (a & x) = 100000
c = (x ^ b) = 000100 // this step detects the lsb
lsb = c
Then to tell if you are done, do x ^= lsb and test for zero.
If you wanted to turn that lsb (which is an actual bit) into a bit-number, that's where a lookup table or an unrolled binary search could be just what you need.
Is that what you wanted?
I would suggest using some sort of lookup table (per byte, or short, depending on the available resources) that would tell you which bits are clear in a certain value.
Here's a quick micro-benchmark; please run it if you can to get stats for your system, and please add your own algorithms!
The commandline:
g++ -o bit_twiddle_mirco_opt bit_twiddle_mirco_opt.cpp -O9 -fomit-frame-pointer -DNDEBUG -march=native
And the code:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <stdint.h>
static unsigned long get_usecs() {
struct timeval tv;
gettimeofday(&tv,NULL);
return tv.tv_sec*1000000+tv.tv_usec;
}
enum { MAX_HEIGHT = 64 };
uint64_t board[MAX_HEIGHT];
int xsum, ysum;
void evaluate(int x,int y) {
xsum += x;
ysum += y;
}
void alphaneo_unrolled_8(int height) {
for(int y=0; y < height; y++) {
uint64_t xbits = ~board[y];
int x = 0;
while(xbits) {
if(xbits & (1 << 0))
evaluate(x,y);
if(xbits & (1 << 1))
evaluate(x+1,y);
if(xbits & (1 << 2))
evaluate(x+2,y);
if(xbits & (1 << 3))
evaluate(x+3,y);
if(xbits & (1 << 4))
evaluate(x+4,y);
if(xbits & (1 << 5))
evaluate(x+5,y);
if(xbits & (1 << 6))
evaluate(x+6,y);
if(xbits & (1 << 7))
evaluate(x+7,y);
x+=8;
xbits >>= 8;
}
}
}
void will_while(int height) {
for(int y=0; y<height; y++) {
uint64_t xbits = ~board[y];
int x = 0;
while(xbits) {
if(xbits & 1)
evaluate(x,y);
xbits >>= 1;
x++;
}
}
}
void will_ffs(int height) {
for(int y=0; y<height; y++) {
uint64_t xbits = ~board[y];
int x = __builtin_ffsl(xbits);
while(x) {
evaluate(x-1,y);
xbits >>= x;
xbits <<= x;
x = __builtin_ffsl(xbits);
}
}
}
void rnd_board(int dim) {
for(int y=0; y<dim; y++) {
board[y] = ~(((uint64_t)1 << dim)-1);
for(int x=0; x<dim; x++)
if(random() & 1)
board[y] |= (uint64_t)1 << x;
}
}
void test(const char* name,void(*func)(int)) {
srandom(0);
printf("testing %s... ",name);
xsum = ysum = 0;
const unsigned long start = get_usecs();
for(int i=0; i<100000; i++) {
const int dim = (random() % MAX_HEIGHT) + 1;
rnd_board(dim);
func(dim);
}
const unsigned long stop = get_usecs();
printf("%lu usecs (check %d,%d)\n",stop-start,xsum,ysum);
}
int main() {
test("will_while()",will_while);
test("will_ffs()",will_ffs);
test("alphaneo_unrolled_8()",alphaneo_unrolled_8);
return 0;
}
Does your profiling indicate you're mostly spending time in the inner while loop, or are you spending most of the time doing the ~board[y] calculation and then incrementing y straight away?
If it's the latter, you might be better off having a second level bitmap, with each bit in that map eliminating an entire 64b word in your board bitmap -- that way you can skip a fair bit further ahead, and if you're lucky, avoid loading entire cache lines of your bitmap.
What's the distribution of the number of bits set in your bitmap?
If you have very few unset bits, then don't use a bitfield at all, use a sparse representation. By that, I mean keep an array of integers containing the index of each unset bit. Iterating through the unset bits is just iterating through the array. Setting and clearing bits becomes more complicated, but if finding an unset bit is your most expensive operation, using a sparse representation will likely be a win.
If you think that unset bits will be uncommon, then perhaps a simple
if (xbits != ((uint64_t)-1))
{
// regular code goes here
}
would be a win. That way, in the common case (all bits in the word are set) you would skip past 64 set bits in one go.
A variation of the look up table version:
Have a lookup table for the next unset bit for 8-bits. Check 8-bit blocks and AND to 0xFF, compare to see if the result's still 0xFF. If it is, skip, else lookup at the table?