Fastest way to convert unsigned char 8 bits to actual numbers - c++

I am using an unsigned char to store 8 flags. Each flag represents the corner of a cube. So 00000001 will be corner 1 01000100 will be corners 3 and 7 etc. My current solution is to & the result with 1,2,4,8,16,32,64 and 128, check whether the result is not zero and store the corner. That is, if (result & 1) corners.push_back(1);. Any chance I can get rid of that 'if' statement? I was hoping I could get rid of it with bitwise operators but I could not think of any.
A little background on why I want to get rid of the if statement. This cube is actually a Voxel which is part of a grid that is at least 512x512x512 in size. That is more than 134 million Voxels. I am performing calculations on each one of the Voxels (well, not exactly, but I won't go into too much detail as it is irrelevant here) and that is a lot of calculations. And I need to perform these calculations per frame. Any speed boost that is minuscule per function call will help with these amount of calculations. To give you an idea, my algorithm (at some point) needed to determine whether a float was negative, positive or zero (within some error). I had if statements in there and greater/smaller than checks. I replaced that with a fast float to int function and shaved of a quarter of a second. Currently, each frame in a 128x128x128 grid takes a little more than 4 seconds.

I would consider a different approach to it entirely: there are only 256 possibilities for different combinations of flags. Precalculate 256 vectors and index into them as needed.
std::vector<std::vector<int> > corners(256);
for (int i = 0; i < 256; ++i) {
std::vector<int>& v = corners[i];
if (i & 1) v.push_back(1);
if (i & 2) v.push_back(2);
if (i & 4) v.push_back(4);
if (i & 8) v.push_back(8);
if (i & 16) v.push_back(16);
if (i & 32) v.push_back(32);
if (i & 64) v.push_back(64);
if (i & 128) v.push_back(128);
}
for (int i = 0; i < NumVoxels(); ++i) {
unsigned char flags = GetFlags(i);
const std::vector& v = corners[flags];
... // do whatever with v
}
This would avoid all the conditionals and having push_back call new which I suspect would be more expensive anyway.

If there's some operation that needs to be done if the bit is set and not if it's not, it seems you'll have to have a conditional of some kind somewhere. If it could be expressed as a calculation somehow, you could get around it like this, for example:
numCorners = ((result >> 0) & 1) + ((result >> 1) & 1) + ((result >> 2) & 1) + ...

Hackers's Delight, first page:
x & (-x) // isolates the lowest set bit
x & (x - 1) // clears the lowest set bit
Inlining your push_back method would also help (better create a function that receives all the flags together).
Usually if you need performance, you should design the whole system with that in mind. Maybe if you post more code it will be easier to help.
EDIT: here is a nice idea:
unsigned char LOG2_LUT[256] = {...};
int t;
switch (count_set_bits(flags)){
case 8: t = flags;
flags &= (flags - 1); // clearing a bit that was set
t ^= flags; // getting the changed bit
corners.push_back(LOG2_LUT[t]);
case 7: t = flags;
flags &= (flags - 1);
t ^= flags;
corners.push_back(LOG2_LUT[t]);
case 6: t = flags;
flags &= (flags - 1);
t ^= flags;
corners.push_back(LOG2_LUT[t]);
// etc...
};
count_set_bits() is a very known function: http://www-graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable

There is a way, it's not "pretty", but it works.
(result & 1) && corners.push_back(1);
(result & 2) && corners.push_back(2);
(result & 4) && corners.push_back(3);
(result & 8) && corners.push_back(4);
(result & 16) && corners.push_back(5);
(result & 32) && corners.push_back(6);
(result & 64) && corners.push_back(7);
(result & 128) && corners.push_back(8);
it uses a seldom known feature of the C++ language: the boolean shortcut.

I've noted a similar algorithm in the OpenTTD code. It turned out to be utterly useless: you're faster off by not breaking down numbers like that. Instead, replace the iteration over the vector<> you have now by an iteration over the bits of the byte. This is far more cache-friendly.
I.e.
unsigned char flags = Foo(); // the value you didn't put in a vector<>
for (unsigned char c = (UCHAR_MAX >> 1) + 1; c !=0 ; c >>= 1)
{
if (flags & c)
Bar(flags&c);
}

Related

Fastest Way to XOR all bits from value based on bitmask?

I've got an interesting problem that has me looking for a more efficient way of doing things.
Let's say we have a value (in binary)
(VALUE) 10110001
(MASK) 00110010
----------------
(AND) 00110000
Now, I need to be able to XOR any bits from the (AND) value that are set in the (MASK) value (always lowest to highest bit):
(RESULT) AND1(0) xor AND4(1) xor AND5(1) = 0
Now, on paper, this is certainly quick since I can see which bits are set in the mask. It seems to me that programmatically I would need to keep right shifting the MASK until I found a set bit, XOR it with a separate value, and loop until the entire byte is complete.
Can anyone think of a faster way? I'm looking for the way to do this with the least number of operations and stored values.
If I understood this question correctly, what you want is to get every bit from VALUE that is set in the MASK, and compute the XOR of those bits.
First of all, note that XOR'ing a value with 0 will not change the result. So, to ignore some bits, we can treat them as zeros.
So, XORing the bits set in VALUE that are in MASK is equivalent to XORing the bits in VALUE&MASK.
Now note that the result is 0 if the number of set bits is even, 1 if it is odd.
That means we want to count the number of set bits. Some architectures/compilers have ways to quickly compute this value. For instance, on GCC this can be obtained with __builtin_popcount.
So on GCC, this can be computed with:
int set_bits = __builtin_popcount(value & mask);
return set_bits % 2;
If you want the code to be portable, then this won't do. However, a comment in this answer suggests that some compilers can inline std::bitset::count to efficiently obtain the same result.
If I'm understanding you right, you have
result = value & mask
and you want to XOR the 1 bits of mask & result together. The XOR of a series of bits is the same as counting the number of bits and checking if that count is even or odd. If it's odd, the XOR would be 1; if even, XOR would give 0.
count_bits(mask & result) % 2 != 0
mask & result can be simplified to simply result. You don't need to AND it with mask again. The % 2 != 0 can be alternately written as & 1.
count_bits(result) & 1
As far as how to count bits, the Bit Twiddling Hacks web page gives a number of bit counting algorithms.
Counting bits set, Brian Kernighan's way
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
Brian Kernighan's method goes through as many iterations as there are
set bits. So if we have a 32-bit word with only the high bit set, then
it will only go once through the loop.
If you were to use that implementation, you could optimize it a bit further. If you think about it, you don't need the full count of bits. You only need to track their parity. Instead of counting bits you could just flip c each iteration.
unsigned bit_parity(unsigned v) {
unsigned c;
for (c = 0; v; c ^= 1) {
v &= v - 1;
}
}
(Thanks to Slava for the suggestion.)
Using that the XOR with 0 doesn't change anything, it's OK to apply the mask and then unconditionally XOR all bits together, which can be done in a parallel-prefix way. So something like this (not tested):
x = m & v;
x ^= x >> 16;
x ^= x >> 8;
x ^= x >> 4;
x ^= x >> 2;
x ^= x >> 1;
result = x & 1
You can use more (or fewer) steps as needed, this is for 32 bits.
One significant issue to be aware of if using v &= v - 1 in the main body of your code is it will change the value of v to 0 in conducting the count. With other methods, the value is changed to the number of 1's. While count logic is generally wrapped as a function, where that is no longer a concern, if you are required to present your counting logic in the main body of your code, you must preserve a copy of v if that value is needed again.
In addition to the other two methods presented, the following is another favorite from bit-twiddling hacks that generally has a bit better performance than the loop method for larger numbers:
/* get the population 1's in the binary representation of a number */
unsigned getn1s (unsigned int v)
{
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = (v + (v >> 4)) & 0x0F0F0F0F;
v = v + (v << 8);
v = v + (v << 16);
return v >> 24;
}

Test zero for 4 bytes in an int

I come here to ask for tricks. I've got a 32-bit integer (that's 4 bytes). I want to test zero for each byte, and return true if one of them is true.
E.g.
int c1 = 0x01020304
cout<<test(c1)<<endl; // output false
int c2 = 0x00010203
cout<<test(c2)<<endl; // output true
int c3 = 0xfffefc00
cout<<test(c3)<<endl; // output true
Are there any tricks to do it in the least number of CPU cycles?
There are several ways in the famous bithacks page
bool hasZeroByte(unsigned int v)
{
return ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
}
or
bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
if (hasZeroByte) // or may just have 0x80 in the high byte
{
hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
}
And the likely most compact way when compiling to assembly
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
As they're tricks, they're hard to understand so if you want clarity, mask out each byte and check like in dasblinkenlight's answer
Example assembly output on Compiler Explorer
You can test it by masking each of the bytes in an & operation, and comparing the result to zero:
bool hasZeroByte(int32_t n) {
return !(n & 0x000000FF)
|| !(n & 0x0000FF00)
|| !(n & 0x00FF0000)
|| !(n & 0xFF000000);
}
The fastest way to do this is probably to use strnlen, since most compilers will have optimized this to use low level instructions for finding zero bytes in strings.
bool hasZeroByte(int32_t n) {
return strnlen(reinterpret_cast<char *>(&n), 4) < 4;
}
If you want to be a little more explicit, you could use the memchr function which is documented to do exactly what you are asking:
bool hasZeroByte(int32_t n) {
return memchr(reinterpret_cast<void *>(&n), 0, 4) != nullptr;
}
For those who don't believe this answer, feel free to take a look at the glibc implementation of strlen and see that it is already doing all of the mentioned bit twiddling tricks in the other answers.
See also:
http://www.strchr.com/optimized_strlen_function
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2
http://www.int80h.org/strlen/

how to optimize C++/C code for a large number of integers

I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes.
uint64_t func(char* data)
{
uint64_t var1 = 0; int i=0;
while ((data[i] >> 7) == 0)
{
variable = (variable << 7) | (data[i]);
i++;
}
return variable;
}
Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code?
EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.
I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code:
// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and
// ... bit 7 indicating "more data in next byte"
uint64_t unpack( const uint8_t *const data )
{
uint64_t value = ((data[0] & 0x7F ) << 0)
| ((data[1] & 0x7F ) << 7)
| ((data[2] & 0x7F ) << 14)
| ((data[3] & 0x7F ) << 21)
| ((data[4] & 0x7Full) << 28)
| ((data[5] & 0x7Full) << 35)
| ((data[6] & 0x7Full) << 42)
| ((data[7] & 0x7Full) << 49)
| ((data[8] & 0x7Full) << 56)
| ((data[9] & 0x7Full) << 63);
if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;
return value;
}
The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the else keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior.
Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard uleb128, where the value gets sent LSB first, and bit 7 == 1 means that the data continues.
There are SIMD approaches, but none of them lend themselves readily to 7-bit data.
Also, if you can mark this inline in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.
Your code is problematic
uint64_t func(const unsigned char* pos)
{
uint64_t var1 = 0; int i=0;
while ((pos[i] >> 7) == 0)
{
var1 = (var1 << 7) | (pos[i]);
i++;
}
return var1;
}
First a minor thing: i should be unsigned.
Second: You don't assert that you don't read beyond the boundary of pos. E.g. if all values of your pos array are 0, then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. You should pass the size of your array to the function and check that i is smaller than this size.
Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10, then previous work get's discarded (as you push the old value out of var1).
The third point actually helps us:
uint64_t func(const unsigned char* pos, size_t size)
{
size_t i(0);
while ( i < size && (pos[i] >> 7) == 0 )
{
++i;
}
// At this point, i is either equal to size or
// i is the index of the first pos value you don't want to use.
// Therefore we want to use the values
// pos[i-10], pos[i-9], ..., pos[i-1]
// if i is less than 10, we obviously need to ignore some of the values
const size_t start = (i >= 10) ? (i - 10) : 0;
uint64_t var1 = 0;
for ( size_t j(start); j < i; ++j )
{
var1 <<= 7;
var1 += pos[j];
}
return var1;
}
In conclusion: We separated logic and got rid of all discarded entries. The speed-up depends on the actual data you have. If lot's of entries are discarded then you save a lot of writes to var1 with this approach.
Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. Perhaps you can have come up with an additional condition that makes the call of this function useless.
Keep in mind that if you actually use 10 values, the first value ends up the be truncated.
64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. You might want to switch to uint128_t.
A small optimization would be:
while ((pos[i] & 0x80) == 0)
Bitwise and is generally faster than a shift. This of course depends on the platform, and it's also possible that the compiler will do this optimization itself.
Can you change the encoding?
Google came across the same problem, and Jeff Dean describes a really cool solution on slide 55 of his presentation:
http://research.google.com/people/jeff/WSDM09-keynote.pdf‎
http://videolectures.net/wsdm09_dean_cblirs/
The basic idea is that reading the first bit of several bytes is poorly supported on modern architectures. Instead, let's take 8 of these bits, and pack them as a single byte preceding the data. We then use the prefix byte to index into a 256-item lookup table, which holds masks describing how to extract numbers from the rest of the data.
I believe it's how protocol buffers are currently encoded.
Can you change your encoding? As you've discovered, using a bit on each byte to indicate if there's another byte following really sucks for processing efficiency.
A better way to do it is to model UTF-8, which encodes the length of the full int into the first byte:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx 10xxxxxx // two bytes with 12 bits of data
110xxxxx 10xxxxxx 10xxxxxx // three bytes with 16 bits of data
1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx // four bytes with 22 bits of data
// etc.
But UTF-8 has special properties to make it easier to distinguish from ASCII. This bloats the data and you don't care about ASCII, so you'd modify it to look like this:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx xxxxxxxx // two bytes with 14 bits of data.
110xxxxx xxxxxxxx xxxxxxxx // three bytes with 21 bits of data
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // four bytes with 28 bits of data
// etc.
This has the same compression level as your method (up to 64 bits = 9 bytes), but is significantly easier for a CPU to process.
From this you can build a lookup table for the first byte which gives you a mask and length:
// byte_counts[255] contains the number of additional
// bytes if the first byte has a value of 255.
uint8_t const byte_counts[256]; // a global constant.
// byte_masks[255] contains a mask for the useful bits in
// the first byte, if the first byte has a value of 255.
uint8_t const byte_masks[256]; // a global constant.
And then to decode:
// the resulting value.
uint64_t v = 0;
// mask off the data bits in the first byte.
v = *data & byte_masks[*data];
// read in the rest.
switch(byte_counts[*data])
{
case 3: v = v << 8 | *++data;
case 2: v = v << 8 | *++data;
case 1: v = v << 8 | *++data;
case 0: return v;
default:
// If you're on VC++, this'll make it take one less branch.
// Better make sure you've got all the valid inputs covered, though!
__assume(0);
}
No matter the size of the integer, this hits only one branch point: the switch, which will likely be put into a jump table. You can potentially optimize it even further for ILP by not letting each case fall through.
First, rather than shifting, you can do a bitwise test on the
relevant bit. Second, you can use a pointer, rather than
indexing (but the compiler should do this optimization itself.
Thus:
uint64_t
readUnsignedVarLength( unsigned char const* pos )
{
uint64_t results = 0;
while ( (*pos & 0x80) == 0 ) {
results = (results << 7) | *pos;
++ pos;
}
return results;
}
At least, this corresponds to what your code does. For variable
length encoding of unsigned integers, it is incorrect, since
1) variable length encodings are little endian, and your code is
big endian, and 2) your code doesn't or in the high order byte.
Finally, the Wiki page suggests that you've got the test
inversed. (I know this format mainly from BER encoding and
Google protocol buffers, both of which set bit 7 to indicate
that another byte will follow.
The routine I use is:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
int shift = 0;
uint64_t results = 0;
uint8_t tmp = *source ++;
while ( ( tmp & 0x80 ) != 0 ) {
*value |= ( tmp & 0x7F ) << shift;
shift += 7;
tmp = *source ++;
}
return results | (tmp << shift);
}
For the rest, this wasn't written with performance in mind, but
I doubt that you could do significantly better. An alternative
solution would be to pick up all of the bytes first, then
process them in reverse order:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
unsigned char buffer[10];
unsigned char* p = std::begin( buffer );
while ( p != std::end( buffer ) && (*source & 0x80) != 0 ) {
*p = *source & 0x7F;
++ p;
}
assert( p != std::end( buffer ) );
*p = *source;
++ p;
uint64_t results = 0;
while ( p != std::begin( buffer ) ) {
-- p;
results = (results << 7) + *p;
}
return results;
}
The necessity of checking for buffer overrun will likely make
this slightly slower, but on some architectures, shifting by
a constant is significantly faster than shifting by a variable,
so this could be faster on them.
Globally, however, don't expect miracles. The motivation for
using variable length integers is to reduce data size, at
a cost in runtime for decoding and encoding.

Constrain a 16 bit signed value between 0 and 4095 using Bit Manipulation only (without branching)

I want to constrain the value of a signed short variable between 0 and 4095, after which I take the most significant 8 bits as my final value for use elsewhere. Right now I'm doing it in a basic manner as below:
short color = /* some external source */;
/*
* I get the color value as a 16 bit signed integer from an
* external source I cannot trust. 16 bits are being used here
* for higher precision.
*/
if ( color < 0 ) {
color = 0;
}
else if ( color > 4095 ) {
color = 4095;
}
unsigned char color8bit = 0xFF & (color >> 4);
/*
* color8bit is my final value which I would actually use
* in my application.
*/
Is there any way this can be done using bit manipulation only, i.e. without using any conditionals? It might help quite a bit in speeding things up as this operation is happening thousands of time in the code.
The following won't help as it doesn't take care of edge cases such as negative values and overflows:
unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 );
Edit: Adam Rosenfield's answer is the one which takes the correct approach but its incorrectly implemented. ouah's answer gives correct results but takes a different approach that what I originally intended to find out.
This is what I ended up using:
const static short min = 0;
const static short max = 4095;
color = min ^ (( min ^ color ) & -( min < color ));
color = max ^ (( color ^ max ) & -( color < max ));
unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 );
Yes, see these bit-twiddling hacks:
short color = ...;
color = color ^ (color & -(color < 0)); // color = max(color, 0)
color = 4096 ^ ((color ^ 4096) & -(color < 4096)); // color = min(color, 4096)
unsigned char color8bit = 0xFF & (color >> 4);
Whether this actually turns out to be faster, I don't know -- you should profile. Most modern x86 and x86-64 chips these days support "conditional move" instructions (cmov) which conditionally store a value depending on the EFLAGS status bits, and optimizing compilers will often produce these instructions from ternary expressions like color >= 0 ? color : 0. Those will likely be fastest, but they won't run on older x86 chips.
You can do the following:
BYTE data[0x10000] = { ..... };
BYTE byte_color = data[(unsiged short)short_color];
In your days 64kb table is not something outrageous and may be acceptable. The number of assembler commands in this variant of code will be absolute minimum compared to other possible approaches.
short color = /* ... */
color = ((((!!(color >> 12)) * 0xFFF)) | (!(color >> 12) * color ))
& (!(color >> 15) * 0xFFF);
unsigned char color8bit = 0xFF & (color >> 4);
It assumes two's complement representation.
This has the advantage of not using any equality or relational operators. There are situations you want to avoid branches at all costs: in some security applications you don't want the attackers to perform branch predictions. Without branches (in embedded processors particularly) you can make your function run in constant time for all inputs.
Note that: x * 0xFFF can be further reduced to (x << 12) - x. Also the multiplication in (!(color >> 12) * color ) can also be further optimized as the left operand of * here is 0 or 1.
EDIT:
I add a little explanation: the expression above simply does the same as below without the use of the conditional and relational operators:
y = ((y > 4095 ? 4095 : 0) | (y > 4095 ? 0 : y))
& (y < 0 ? 0 : 4095);
EDIT2:
as #HotLicks correctly noted in his comment, the ! is still a conceptual branch. Nevertheless it can also be computed with bitwise operators. For example !!a can be done with the trivial:
b = (a >> 15 | a >> 14 | ... | a >> 1 | a) & 1
and !a can be done as b ^ 1. And I'm sure there is a nice hack to do it more effectively.
I assume a short is 16 bits.
Remove negative values:
int16_t mask=-(int16_t)((uint16_t)color>>15);//0xFFFF if +ve, 0 if -ve
short value=color&mask;//0 if -ve, colour if +ve
value is now between 0 and 32767 inclusive.
You can then do something similar to clamp the value:
mask=(uint16_t)(value-4096)>>15;//1 if <=4095, 0 if >4095
--mask;//0 if <=4095, 0xFFFF if >4095
mask&=0xFFF;//0 if <=4095, 4095 if >4095
value|=mask;//4095 if >4095, color if <4095
You could also easily vectorize this using Intel's SSE intrinsics. One 128-bit register would hold 8 of your short and there are functions to min/max/shift/mask all of them in parallel. In a loop the constants for min/max can be preloaded into a register. The pshufb instruction (part of SSSE3) will even pack the bytes for you.
I'm going to leave an answer even though it doesn't directly answer the original question, because in the end I think you'll find it much more useful.
I'm assuming that your color is coming from a camera or image scanner running at 12 bits, followed by some undetermined processing step that might create values beyond the 0 to 4095 range. If that's the case the values are almost certainly derived in a linear fashion. The problem is that displays are gamma corrected, so the conversion from 12 bit to 8 bit will require a non-linear gamma function rather than a simple right shift. This will be much slower than the clamping operation your question is trying to optimize. If you don't use a gamma function the image will appear too dark.
short color = /* some external source */;
unsigned char color8bit;
if (color <= 0)
color8bit = 0;
else if (color >= 4095)
color8bit = 255;
else
color8bit = (unsigned char)(255.99 * pow(color / 4095.0, 1/2.2));
At this point you might consider a lookup table as suggested by Kirill Kobelev.
This is somewhat akin to Tom Seddon's answer, but uses a slightly cleaner way to do the clamp above. Note that both Mr. Seddon's answer and mine avoid the issue of ouah's answer that shifting a signed value to the right is implementation defined behavior, and hence not guaranteed to work on all architenctures.
#include <inttypes.h>
#include <iostream>
int16_t clamp(int16_t value)
{
// clampBelow is 0xffff for -ve, 0x0000 for +ve
int16_t const clampBelow = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15);
// value is now clamped below at zero
value &= ~clampBelow;
// subtract 4095 so we can do the same trick again
value -= 4095;
// clampAbove is 0xffff for -ve, 0x0000 for +ve,
// i.e. 0xffff for original value < 4095, 0x0000 for original >= 4096
int16_t const clampAbove = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15);
// adjusted value now clamped above at zero
value &= clampAbove;
// and restore to original value.
value += 4095;
return value;
}
void verify(int16_t value)
{
int16_t const clamped = clamp(value);
int16_t const check = (value < 0 ? 0 : value > 4095 ? 4095 : value);
if (clamped != check)
{
std::cout << "Verification falure for value: " << value << ", clamped: " << clamped << ", check: " << check << std::endl;
}
}
int main()
{
for (int16_t i = 0x4000; i != 0x3fff; i++)
{
verify(i);
}
return 0;
}
That's a full test program (OK, so it doesn't test 0x3fff - sue me. ;) ) from which you can extract the clamp() routine for whatever you need.
I've also broken clamp out to "one step per line" for the sake of clarity. If your compiler has a half way decent optimizer, you can leave it as is and rely on the compiler to produce the best possible code. If your compiler's optimizer is not that great, then by all means, it can be reduced in line count, albeit at the cost of a little readability.
"Never sacrifice clarity for efficiency" -- Bob Buckley, comp sci professor, U-Warwick, Coventry, England, 1980.
Best piece of advice I ever got. ;)

How to replace this if/else statement with bitwise operations?

I am doing bitwise & between two bit arrays saving the result in old_array and I want to get rid of the if/else statement. I should probably make use of the BIT_STATE macro, but how?
#define BYTE_POS(pos) (pos / CHAR_BIT)
#define BIT_POS(pos) (1 << (CHAR_BIT - 1 - (pos % CHAR_BIT)))
#define BIT_STATE(pos, state) (state << (CHAR_BIT - 1 - (pos % CHAR_BIT)))
if (((old_array[BYTE_POS(old_pos)] & BIT_POS(old_pos)) != 0) &&
((new_array[BYTE_POS(new_pos)] & BIT_POS(new_pos)) != 0))
{
old_array[BYTE_POS(old_pos)] |= BIT_POS(old_pos);
}
else
{
old_array[BYTE_POS(old_pos)] &= ~(BIT_POS(old_pos));
}
You can always calculate both results and then combine it. The biggest problem is to compute a fitting bitmask.
E.g.
const uint32_t a = 41,
uint32_t b = 8;
const uint32_t mask[2] = { 0, 0xffffffff };
const uint32_t result = (a&mask[condition])
| (b&mask[!condition]);
or to avoid the unary not
const uint32_t mask_a[2] = { 0, 0xffffffff },
mask_b[2] = { mask_a[1], mask_a[0] };
const uint32_t result = (a&mask_a[condition])
| (b&mask_b[condition]);
However: When doing bitwise manipulations, always be careful with the number of bits involved. One way to be careful is fixed size types like uint32_t, who may or may not be defined on your platform (but if not, the good thing is you get a compile error), or use templates carefully. Other types, including char, int and even bool can have any size beyond some defined minimum.
Yes, such code looks somewhat ugly.
I don't think BIT_STATE is useful here. (State MUST BE 0 or 1 to work as expected)
I see following approaches to get rid of them
a) Use C++ bitfields
For example
http://en.wikipedia.org/wiki/Bit_field
b)
"Hide" that code in a class/method/function
c)
I think this is equivalent to your code
if ((new_array[BYTE_POS(new_pos)] & BIT_POS(new_pos)) == 0))
{
old_array[BYTE_POS(old_pos)] &= ~(BIT_POS(old_pos));
}
or as inliner
old_array[BYTE_POS(old_pos)] &=
~((new_array[BYTE_POS(new_pos)] & BIT_POS(new_pos)) ? 0 : BIT_POS(old_pos));
Take the expression
(new_array[BYTE_POS(new_pos)] & BIT_POS(new_pos))
which is either 0 or has 1 in bit BIT_POS(new_pos) and shift it until the bit, if set is in BIT_POS( old_pos )
(new_array[BYTE_POS(new_pos)] & BIT_POS(new_pos)) << ( old_pos - new_pos )
now and the result with old_array[BYTE_POS(old_pos)]
old_array[BYTE_POS(old_pos)] &= old_array[BYTE_POS(old_pos)]
THe only trick is that it is implementation dependent (at least it used to be) what happens if you shift by a negative amount. So if you already know whether old_pos is greater or less than new_pos you can substitute >> ( new_pos - old_pos ) when appropriate.
I've not tried this out. I may have << and >> swapped.