Trouble understanding piece of code. Bitwise operations in c - bit-manipulation

I have the following segment of code and am having trouble deciphering what it does.
/* assume 0 <= n <=3 and 0 <= m <=3 */
int n8= n <<3;
int m8 = m <<3;
int n_mask = 0xff << n8;
int m_mask = 0xff << m8; // left bitshifts 255 by the value of m8
int n_byte = ((x & n_mask) >> n8) & 0xff;
int m_byte = ((x & m_mask) >> m8) & 0xff;
int bytes_mask = n_mask | m_mask ;
int leftover = x & ~bytes_mask;
return ( leftover | (n_byte <<m8)| (m_byte << n8) );

It swaps the nth and mth bytes.
The start has two parallel computations, one sequence with n and one sequence with m, that select the nth and mth byte like this:
Step 1: 0xff << n8
0x000000ff << 0 = 0x000000ff
.. 8 = 0x0000ff00
.. 16 = 0x00ff0000
.. 24 = 0xff000000
Step 2: x & n_mask
x = 0xDDCCBBAA
x & 0x000000ff = 0x000000AA
x & 0x0000ff00 = 0x0000BB00
x & 0x00ff0000 = 0x00CC0000
x & 0xff000000 = 0xDD000000
Step 3: ((x & n_mask) >> n8) & 0xff (note: & 0xff is required because the right shift is likely to be an arithmetic right shift, it would not be required if the code worked with unsigned integers)
n = 0: 0x000000AA
1: 0x000000BB
2: 0x000000CC
3: 0x000000DD
So it extracts the nth byte and puts it at the bottom of the integer.
The same thing is done for m.
leftover is the other (2 or 3) bytes, the ones not extracted by the previous process. There may be 3 bytes left over, because n and m can be the same.
Finally the last step is to put it all back together, but with the byte extracted from the nth position shifted to the mth position, and the mth byte shifted to the nth position, so they switch places.

Related

Optimize generating a parent bitmask from child bitmasks

Given a 64 bit child mask input, for example:
10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000000
The 8 bit parent mask would be:
11111110
A single bit in the parent mask maps to 8 bits in the child mask string, and the bit in the parent mask is set to 1 when one of the 8 child bits is set to 1. A simple algorithm to calculate this would be the following:
unsigned __int64 childMask = 0x8040201008040200; // The number above in hex
unsigned __int8 parentMask = 0;
for (int i = 0; i < 8; i++)
{
const unsigned __int8 child = childMask >> (8 * i);
parentMask |= (child > 0) << i;
}
I'm wondering if there's any optimizations left to do in the code above. The code will be run on CUDA, where I'd like to avoid branches whenever possible. For an answer, code in C++/C will do fine. The for loop can be unrolled, but I'd rather leave that to the compiler to optimize, giving hints where necessary using for example the #pragma unroll.
A possible approach is to use __vcmpgtu4 to do the per-byte comparisons, which returns the result as packed masks, which can be AND-ed with 0x08040201 (0x80402010 for the high half) to turn them into the bits of the final result, but then they need to be summed horizontally which does not seem to be well-supported but it can be done with plain old C-style code.
For example,
unsigned int low = childMask;
unsigned int high = childMask >> 32;
unsigned int lowmask = __vcmpgtu4(low, 0) & 0x08040201;
unsigned int highmask = __vcmpgtu4(high, 0) & 0x80402010;
unsigned int mask = lowmask | highmask;
mask |= mask >> 16;
mask |= mask >> 8;
parentMask = mask & 0xff;
This solution based on classical bit-twiddling techniques may be faster than the accepted answer on at least some GPU architectures supported by CUDA, since __vcmp* intrinsics are not fast on all of them.
Since GPUs are basically 32-bit architectures, the 64-bit childMask is processed as two halves, hi and lo.
The processing consists of three steps. In the first step, we set each non-null byte to 0x80 and leave the byte unmodified otherwise. In other words, we set the most significant bit of each byte if the byte is non-zero. One method is to use a modified version of a null-byte detection algorithm Alan Mycroft devised in the 1980s and which is often used for C-string processing. Alternatively we can use the fact that hadd (~0, x) has the most significant bit set only if x != 0, where hadd is a halving add: hadd (a, b) = (a + b) / 2, without overflow in the intermediate computation. An efficient implementation was published by Peter L. Montgomery in 2000.
In the second step, we collect the most significant bits of each byte into the highest nibble. For this, we need to move bit 7 to bit 28, bits 15 to bit 29, bit 23 to bit 30, and bit 31 to bits 31, corresponding to the shift factors of 21, 14, 7, and 0. In order to avoid separate shifts, we combine the shift factors into a single "magic" multiplier, then multiply with that, thus performing all shifts in parallel.
In the third step we combine the nibbles containing the result and move them into the correct bit position. For the hi word, that means moving the nibble in bits <31:28> into bits <7:4> and for the lo word this means moving the nibble in bits <31:28> into bits <3:0>. This combination can be performed either with bit-wise OR or addition. Which variant is faster may depend on the target architecture.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define USE_HAROLDS_SOLUTION (0)
#define USE_MYCROFT_ZEROBYTE (0)
#define USE_TWO_MASKS (1)
#define USE_ADD_COMBINATION (1)
uint8_t parentMask (uint64_t childMask)
{
#if USE_TWO_MASKS
const uint32_t LSB_MASK = 0x01010101;
#endif // USE_TWO_MASKS
const uint32_t MSB_MASK = 0x80808080;
const uint32_t MAGICMUL = (1 << 21) | (1 << 14) | (1 << 7) | (1 << 0);
uint32_t lo, hi;
/* split 64-bit argument into two halves for 32-bit GPU architecture */
lo = (uint32_t)(childMask >> 0);
hi = (uint32_t)(childMask >> 32);
#if USE_MYCROFT_ZEROBYTE
/* Set most significant bit in each byte that is not zero. Adapted from Alan
Mycroft's null-byte detection algorithm (newsgroup comp.lang.c, 1987/04/08,
https://groups.google.com/forum/#!original/comp.lang.c/2HtQXvg7iKc/xOJeipH6KLMJ):
null_byte(x) = ((x - 0x01010101) & (~x & 0x80808080))
*/
#if USE_TWO_MASKS
lo = (((lo | MSB_MASK) - LSB_MASK) | lo) & MSB_MASK;
hi = (((hi | MSB_MASK) - LSB_MASK) | hi) & MSB_MASK;
#else // USE_TWO_MASKS
lo = (((lo & ~MSB_MASK) + ~MSB_MASK) | lo) & MSB_MASK;
hi = (((hi & ~MSB_MASK) + ~MSB_MASK) | hi) & MSB_MASK;
#endif // USE_TWO_MASKS
#else // USE_MYCROFT_ZEROBYTE
/* Set most significant bit in each byte that is not zero. Use hadd(~0,x).
Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
(A+B)/2 = (A AND B) + (A XOR B)/2.
*/
#if USE_TWO_MASKS
lo = (((~lo & ~LSB_MASK) >> 1) + lo) & MSB_MASK;
hi = (((~hi & ~LSB_MASK) >> 1) + hi) & MSB_MASK;
#else // USE_TWO_MASKS
lo = (((~lo >> 1) & ~MSB_MASK) + lo) & MSB_MASK;
hi = (((~hi >> 1) & ~MSB_MASK) + hi) & MSB_MASK;
#endif // USE_TWO_MASKS
#endif // USE_MYCROFT_ZEROBYTE
/* collect most significant bit of each byte in most significant nibble */
lo = lo * MAGICMUL;
hi = hi * MAGICMUL;
/* combine nibbles with results for high and low half into final result */
#if USE_ADD_COMBINATION
return (uint8_t)((hi >> 24) + (lo >> 28));
#else // USE_ADD_COMBINATION
return (uint8_t)((hi >> 24) | (lo >> 28));
#endif // USE_ADD_COMBINATION
}
uint8_t parentMask_ref (uint64_t childMask)
{
uint8_t parentMask = 0;
for (uint32_t i = 0; i < 8; i++) {
uint8_t child = childMask >> (8 * i);
parentMask |= (child > 0) << i;
}
return parentMask;
}
uint32_t build_mask (uint32_t a)
{
return ((a & 0x80808080) >> 7) * 0xff;
}
uint32_t vcmpgtu4 (uint32_t a, uint32_t b)
{
uint32_t r;
r = ((a & ~b) + (((a ^ ~b) >> 1) & 0x7f7f7f7f));
r = build_mask (r);
return r;
}
uint8_t parentMask_harold (uint64_t childMask)
{
uint32_t low = childMask;
uint32_t high = childMask >> 32;
uint32_t lowmask = vcmpgtu4 (low, 0) & 0x08040201;
uint32_t highmask = vcmpgtu4 (high, 0) & 0x80402010;
uint32_t mask = lowmask | highmask;
mask |= mask >> 16;
mask |= mask >> 8;
return (uint8_t)mask;
}
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t childMask, count = 0;
uint8_t res, ref;
do {
childMask = KISS64;
ref = parentMask_ref (childMask);
#if USE_HAROLDS_SOLUTION
res = parentMask_harold (childMask);
#else // USE_HAROLDS_SOLUTION
res = parentMask (childMask);
#endif // USE_HAROLDS_SOLUTION
if (res != ref) {
printf ("\narg=%016llx res=%02x ref=%02x\n", childMask, res, ref);
return EXIT_FAILURE;
}
if (!(count & 0xffffff)) printf ("\r%llu", count);
count++;
} while (1);
return EXIT_SUCCESS;
}

Convert every 5 bits into integer values in C++

Firstly, if anyone has a better title for me, let me know.
Here is an example of the process I am trying to automate with C++
I have an array of values that appear in this format:
9C07 9385 9BC7 00 9BC3 9BC7 9385
I need to convert them to binary and then convert every 5 bits to decimal like so with the last bit being a flag:
I'll do this with only the first word here.
9C07
10011 | 10000 | 00011 | 1
19 | 16 | 3
These are actually x,y,z coordinates and the final bit determines the order they are in a '0' would make it x=19 y=16 z=3 and '1' is x=16 y=3 z=19
I already have a buffer filled with these hex values, but I have no idea where to go from here.
I assume these are integer literals, not strings?
The way to do this is with bitwise right shift (>>) and bitwise AND (&)
#include <cstdint>
struct Coordinate {
std::uint8_t x;
std::uint8_t y;
std::uint8_t z;
constexpr Coordinate(std::uint16_t n) noexcept
{
if (n & 1) { // flag
x = (n >> 6) & 0x1F; // 1 1111
y = (n >> 1) & 0x1F;
z = n >> 11;
} else {
x = n >> 11;
y = (n >> 6) & 0x1F;
z = (n >> 1) & 0x1F;
}
}
};
The following code would extract the three coordinates and the flag from the 16 least significant bits of value (ie. its least significant word).
int flag = value & 1; // keep only the least significant bit
value >>= 1; // shift right by one bit
int third_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits
int second_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits
int first_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits (only useful if there are other words in "value")
What you need is most likely some loop doing this on each word of your array.

8-digit BCD check

I've a 8-digit BCD number and need to check it out to see if it is a valid BCD number. How can I programmatically (C/C++) make this?
Ex: 0x12345678 is valid, but 0x00f00abc isn't.
Thanks in advance!
You need to check each 4-bit quantity to make sure it's less than 10. For efficiency you want to work on as many bits as you can at a single time.
Here I break the digits apart to leave a zero between each one, then add 6 to each and check for overflow.
uint32_t highs = (value & 0xf0f0f0f0) >> 4;
uint32_t lows = value & 0x0f0f0f0f;
bool invalid = (((highs + 0x06060606) | (lows + 0x06060606)) & 0xf0f0f0f0) != 0;
Edit: actually we can do slightly better. It doesn't take 4 bits to detect overflow, only 1. If we divide all the digits by 2, it frees a bit and we can check all the digits at once.
uint32_t halfdigits = (value >> 1) & 0x77777777;
bool invalid = ((halfdigits + 0x33333333) & 0x88888888) != 0;
The obvious way to do this is:
/* returns 1 if x is valid BCD */
int
isvalidbcd (uint32_t x)
{
for (; x; x = x>>4)
{
if ((x & 0xf) >= 0xa)
return 0;
}
return 1;
}
This link tells you all about BCD, and recommends something like this asa more optimised solution (reworking to check all the digits, and hence using a 64 bit data type, and untested):
/* returns 1 if x is valid BCD */
int
isvalidbcd (uint32_t x)
{
return !!(((uint64_t)x + 0x66666666ULL) ^ (uint64_t)x) & 0x111111110ULL;
}
For a digit to be invalid, it needs to be 10-15. That in turn means 8 + 4 or 8+2 - the low bit doesn't matter at all.
So:
long mask8 = value & 0x88888888;
long mask4 = value & 0x44444444;
long mask2 = value & 0x22222222;
return ((mask8 >> 2) & ((mask4 >>1) | mask2) == 0;
Slightly less obvious:
long mask8 = (value>>2);
long mask42 = (value | (value>>1);
return (mask8 & mask42 & 0x22222222) == 0;
By shifting before masking, we don't need 3 different masks.
Inspired by #Mark Ransom
bool invalid = (0x88888888 & (((value & 0xEEEEEEEE) >> 1) + (0x66666666 >> 1))) != 0;
// or
bool valid = !((((value & 0xEEEEEEEEu) >> 1) + 0x33333333) & 0x88888888);
Mask off each BCD digit's 1's place, shift right, then add 6 and check for BCD digit overflow.
How this works:
By adding +6 to each digit, we look for an overflow * of the 4-digit sum.
abcd
+ 110
-----
*efgd
But the bit value of d does not contribute to the sum, so first mask off that bit and shift right. Now the overflow bit is in the 8's place. This all is done in parallel and we mask these carry bits with 0x88888888 and test if any are set.
0abc
+ 11
-----
*efg

How does this algorithm to count the number of set bits in a 32-bit integer work?

int SWAR(unsigned int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
I have seen this code that counts the number of bits equals to 1 in 32-bit integer, and I noticed that its performance is better than __builtin_popcount but I can't understand the way it works.
Can someone give a detailed explanation of how this code works?
OK, let's go through the code line by line:
Line 1:
i = i - ((i >> 1) & 0x55555555);
First of all, the significance of the constant 0x55555555 is that, written using the Java / GCC style binary literal notation),
0x55555555 = 0b01010101010101010101010101010101
That is, all its odd-numbered bits (counting the lowest bit as bit 1 = odd) are 1, and all the even-numbered bits are 0.
The expression ((i >> 1) & 0x55555555) thus shifts the bits of i right by one, and then sets all the even-numbered bits to zero. (Equivalently, we could've first set all the odd-numbered bits of i to zero with & 0xAAAAAAAA and then shifted the result right by one bit.) For convenience, let's call this intermediate value j.
What happens when we subtract this j from the original i? Well, let's see what would happen if i had only two bits:
i j i - j
----------------------------------
0 = 0b00 0 = 0b00 0 = 0b00
1 = 0b01 0 = 0b00 1 = 0b01
2 = 0b10 1 = 0b01 1 = 0b01
3 = 0b11 1 = 0b01 2 = 0b10
Hey! We've managed to count the bits of our two-bit number!
OK, but what if i has more than two bits set? In fact, it's pretty easy to check that the lowest two bits of i - j will still be given by the table above, and so will the third and fourth bits, and the fifth and sixth bits, and so and. In particular:
despite the >> 1, the lowest two bits of i - j are not affected by the third or higher bits of i, since they'll be masked out of j by the & 0x55555555; and
since the lowest two bits of j can never have a greater numerical value than those of i, the subtraction will never borrow from the third bit of i: thus, the lowest two bits of i also cannot affect the third or higher bits of i - j.
In fact, by repeating the same argument, we can see that the calculation on this line, in effect, applies the table above to each of the 16 two-bit blocks in i in parallel. That is, after executing this line, the lowest two bits of the new value of i will now contain the number of bits set among the corresponding bits in the original value of i, and so will the next two bits, and so on.
Line 2:
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
Compared to the first line, this one's quite simple. First, note that
0x33333333 = 0b00110011001100110011001100110011
Thus, i & 0x33333333 takes the two-bit counts calculated above and throws away every second one of them, while (i >> 2) & 0x33333333 does the same after shifting i right by two bits. Then we add the results together.
Thus, in effect, what this line does is take the bitcounts of the lowest two and the second-lowest two bits of the original input, computed on the previous line, and add them together to give the bitcount of the lowest four bits of the input. And, again, it does this in parallel for all the 8 four-bit blocks (= hex digits) of the input.
Line 3:
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
OK, what's going on here?
Well, first of all, (i + (i >> 4)) & 0x0F0F0F0F does exactly the same as the previous line, except it adds the adjacent four-bit bitcounts together to give the bitcounts of each eight-bit block (i.e. byte) of the input. (Here, unlike on the previous line, we can get away with moving the & outside the addition, since we know that the eight-bit bitcount can never exceed 8, and therefore will fit inside four bits without overflowing.)
Now we have a 32-bit number consisting of four 8-bit bytes, each byte holding the number of 1-bit in that byte of the original input. (Let's call these bytes A, B, C and D.) So what happens when we multiply this value (let's call it k) by 0x01010101?
Well, since 0x01010101 = (1 << 24) + (1 << 16) + (1 << 8) + 1, we have:
k * 0x01010101 = (k << 24) + (k << 16) + (k << 8) + k
Thus, the highest byte of the result ends up being the sum of:
its original value, due to the k term, plus
the value of the next lower byte, due to the k << 8 term, plus
the value of the second lower byte, due to the k << 16 term, plus
the value of the fourth and lowest byte, due to the k << 24 term.
(In general, there could also be carries from lower bytes, but since we know the value of each byte is at most 8, we know the addition will never overflow and create a carry.)
That is, the highest byte of k * 0x01010101 ends up being the sum of the bitcounts of all the bytes of the input, i.e. the total bitcount of the 32-bit input number. The final >> 24 then simply shifts this value down from the highest byte to the lowest.
Ps. This code could easily be extended to 64-bit integers, simply by changing the 0x01010101 to 0x0101010101010101 and the >> 24 to >> 56. Indeed, the same method would even work for 128-bit integers; 256 bits would require adding one extra shift / add / mask step, however, since the number 256 no longer quite fits into an 8-bit byte.
I prefer this one, it's much easier to understand.
x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
x = (x & 0x0000ffff) + ((x >> 16) &0x0000ffff);
This is a comment to Ilamari's answer.
I put it as an answer because of format issues:
Line 1:
i = i - ((i >> 1) & 0x55555555); // (1)
This line is derived from this easier to understand line:
i = (i & 0x55555555) + ((i >> 1) & 0x55555555); // (2)
If we call
i = input value
j0 = i & 0x55555555
j1 = (i >> 1) & 0x55555555
k = output value
We can rewrite (1) and (2) to make the explanation clearer:
k = i - j1; // (3)
k = j0 + j1; // (4)
We want to demonstrate that (3) can be derived from (4).
i can be written as the addition of its even and odd bits (counting the lowest bit as bit 1 = odd):
i = iodd + ieven =
= (i & 0x55555555) + (i & 0xAAAAAAAA) =
= (i & modd) + (i & meven)
Since the meven mask clears the last bit of i,
the last equality can be written this way:
i = (i & modd) + ((i >> 1) & modd) << 1 =
= j0 + 2*j1
That is:
j0 = i - 2*j1 (5)
Finally, replacing (5) into (4) we achieve (3):
k = j0 + j1 = i - 2*j1 + j1 = i - j1
This is an explanation of yeer's answer:
int SWAR(unsigned int i) {
i = (i & 0x55555555) + ((i >> 1) & 0x55555555); // A
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // B
i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f); // C
i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff); // D
i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff); // E
return i;
}
Let's use Line A as the basis of my explanation.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555)
Let's rename the above expression as follows:
i = (i & mask) + ((i >> 1) & mask)
= A1 + A2
First, think of i not as 32 bits, but rather as an array of 16 groups, 2 bits each. A1 is the count array of size 16, each group containing the count of 1s at the right-most bit of the corresponding group in i:
i = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask = 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
i & mask = 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x
Similarly, A2 is "counting" the left-most bit for each group in i. Note that I can rewrite A2 = (i >> 1) & mask as A2 = (i & mask2) >> 1:
i = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask2 = 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
(i & mask2) = y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0
(i & mask2) >> 1 = 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y
(Note that mask2 = 0xaaaaaaaa)
Thus, A1 + A2 adds the counts of the A1 array and A2 array, resulting in an array of 16 groups, each group now contains the count of bits in each group.
Moving onto Line B, we can rename the line as follows:
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
= (i & mask) + ((i >> 2) & mask)
= B1 + B2
B1 + B2 follows the same "form" as A1 + A2 from before. Think of i no longer as 16 groups of 2 bits, but rather as 8 groups of 4 bits. So similar to before, B1 + B2 adds the counts of B1 and B2 together, where B1 is the counts of 1s in the right side of the group, and B2 is the counts of the left side of the group. B1 + B2 is thus the counts of bits in each group.
Lines C through E now become more easily understandable:
int SWAR(unsigned int i) {
// A: 16 groups of 2 bits, each group contains number of 1s in that group.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
// B: 8 groups of 4 bits, each group contains number of 1s in that group.
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
// C: 4 groups of 8 bits, each group contains number of 1s in that group.
i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f);
// D: 2 groups of 16 bits, each group contains number of 1s in that group.
i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff);
// E: 1 group of 32 bits, containing the number of 1s in that group.
i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff);
return i;
}

Swapping bits at a given point between two bytes

Let's say I have these two numbers:
x = 0xB7
y = 0xD9
Their binary representations are:
x = 1011 0111
y = 1101 1001
Now I want to crossover (GA) at a given point, say from position 4 onwards.
The expected result should be:
x = 1011 1001
y = 1101 0111
Bitwise, how can I achieve this?
I would just use bitwise operators:
t = (x & 0x0f)
x = (x & 0xf0) | (y & 0x0f)
y = (y & 0xf0) | t
That would work for that specific case. In order to make it more adaptable, I'd put it in a function, something like (pseudo-code, with &, | and ! representing bitwise "and", "or", and "not" respectively):
def swapBits (x, y, s, e):
lookup = [255,127,63,31,15,7,3,1]
mask = lookup[s] & !lookup[e]
t = x & mask
x = (x & !mask) | (y & mask)
y = (y & !mask) | t
return (x,y)
The lookup values allow you to specify which bits to swap. Let's take the values xxxxxxxx for x and yyyyyyyy for y along with start bit s of 2 and end bit e of 6 (bit numbers start at zero on the left in this scenario):
x y s e t mask !mask execute
-------- -------- - - -------- -------- -------- -------
xxxxxxxx yyyyyyyy 2 6 starting point
00111111 mask = lookup[2](00111111)
00111100 & !lookup[6](11111100)
00xxxx00 t = x & mask
xx0000xx x = x & !mask(11000011)
xxyyyyxx | y & mask(00111100)
yy0000yy y = y & !mask(11000011)
yyxxxxyy | t(00xxxx00)
If a bit position is the same in both values, no change is needed in either. If it's opposite, they both need to invert.
XOR with 1 flips a bit; XOR with 0 is a no-op.
So what we want is a value that has a 1 everywhere there's a bit-difference between the inputs, and a 0 everywhere else. That's exactly what a XOR b does.
Simply mask this bit-difference to only keep the differences in the bits we want to swap, and we have a bit-swap in 3 XORs + 1 AND.
Your mask is (1UL << position) -1. One less than a power of 2 has all the bits below that set. Or more generally with a high and low position for your bit-range: (1UL << highpos) - (1UL << lowpos). Whether a lookup-table is faster than bit-set / sub depends on the compiler and hardware. (See #PaxDiablo's answer for the LUT suggestion).
// Portable C:
//static inline
void swapBits_char(unsigned char *A, unsigned char *B)
{
const unsigned highpos = 4, lowpos=0; // function args if you like
const unsigned char mask = (1UL << highpos) - (1UL << lowpos);
unsigned char tmpA = *A, tmpB = *B; // read into locals in case A==B
unsigned char bitdiff = tmpA ^ tmpB;
bitdiff &= mask; // clear all but the selected bits
*A = tmpA ^ bitdiff; // flip bits that differed
*B = tmpB ^ bitdiff;
}
//static inline
void swapBit_uint(unsigned *A, unsigned *B, unsigned mask)
{
unsigned tmpA = *A, tmpB = *B;
unsigned bitdiff = tmpA ^ tmpB;
bitdiff &= mask; // clear all but the selected bits
*A = tmpA ^ bitdiff;
*B = tmpB ^ bitdiff;
}
(Godbolt compiler explorer with gcc for x86-64 and ARM)
This is not an xor-swap. It does use temporary storage. As #chux's answer on a near-duplicate question demonstrates, a masked xor-swap requires 3 AND operations as well as 3 XOR. (And defeats the only benefit of XOR-swap by requiring a temporary register or other storage for the & results.) This answer is a modified copy of my answer on that other question.
This version only requires 1 AND. Also, the last two XORs are independent of each other, so total latency from inputs to both outputs is only 3 operations. (Typically 3 cycles).
For an x86 asm example of this, see this code-golf Exchange capitalization of two strings in 14 bytes of x86-64 machine code (with commented asm source)
Swapping individual bits with XOR
unsigned int i, j; // positions of bit sequences to swap
unsigned int n; // number of consecutive bits in each sequence
unsigned int b; // bits to swap reside in b
unsigned int r; // bit-swapped result goes here
unsigned int x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary
r = b ^ ((x << i) | (x << j));