How does this algorithm to count the number of set bits in a 32-bit integer work? - c++

int SWAR(unsigned int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
I have seen this code that counts the number of bits equals to 1 in 32-bit integer, and I noticed that its performance is better than __builtin_popcount but I can't understand the way it works.
Can someone give a detailed explanation of how this code works?

OK, let's go through the code line by line:
Line 1:
i = i - ((i >> 1) & 0x55555555);
First of all, the significance of the constant 0x55555555 is that, written using the Java / GCC style binary literal notation),
0x55555555 = 0b01010101010101010101010101010101
That is, all its odd-numbered bits (counting the lowest bit as bit 1 = odd) are 1, and all the even-numbered bits are 0.
The expression ((i >> 1) & 0x55555555) thus shifts the bits of i right by one, and then sets all the even-numbered bits to zero. (Equivalently, we could've first set all the odd-numbered bits of i to zero with & 0xAAAAAAAA and then shifted the result right by one bit.) For convenience, let's call this intermediate value j.
What happens when we subtract this j from the original i? Well, let's see what would happen if i had only two bits:
i j i - j
----------------------------------
0 = 0b00 0 = 0b00 0 = 0b00
1 = 0b01 0 = 0b00 1 = 0b01
2 = 0b10 1 = 0b01 1 = 0b01
3 = 0b11 1 = 0b01 2 = 0b10
Hey! We've managed to count the bits of our two-bit number!
OK, but what if i has more than two bits set? In fact, it's pretty easy to check that the lowest two bits of i - j will still be given by the table above, and so will the third and fourth bits, and the fifth and sixth bits, and so and. In particular:
despite the >> 1, the lowest two bits of i - j are not affected by the third or higher bits of i, since they'll be masked out of j by the & 0x55555555; and
since the lowest two bits of j can never have a greater numerical value than those of i, the subtraction will never borrow from the third bit of i: thus, the lowest two bits of i also cannot affect the third or higher bits of i - j.
In fact, by repeating the same argument, we can see that the calculation on this line, in effect, applies the table above to each of the 16 two-bit blocks in i in parallel. That is, after executing this line, the lowest two bits of the new value of i will now contain the number of bits set among the corresponding bits in the original value of i, and so will the next two bits, and so on.
Line 2:
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
Compared to the first line, this one's quite simple. First, note that
0x33333333 = 0b00110011001100110011001100110011
Thus, i & 0x33333333 takes the two-bit counts calculated above and throws away every second one of them, while (i >> 2) & 0x33333333 does the same after shifting i right by two bits. Then we add the results together.
Thus, in effect, what this line does is take the bitcounts of the lowest two and the second-lowest two bits of the original input, computed on the previous line, and add them together to give the bitcount of the lowest four bits of the input. And, again, it does this in parallel for all the 8 four-bit blocks (= hex digits) of the input.
Line 3:
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
OK, what's going on here?
Well, first of all, (i + (i >> 4)) & 0x0F0F0F0F does exactly the same as the previous line, except it adds the adjacent four-bit bitcounts together to give the bitcounts of each eight-bit block (i.e. byte) of the input. (Here, unlike on the previous line, we can get away with moving the & outside the addition, since we know that the eight-bit bitcount can never exceed 8, and therefore will fit inside four bits without overflowing.)
Now we have a 32-bit number consisting of four 8-bit bytes, each byte holding the number of 1-bit in that byte of the original input. (Let's call these bytes A, B, C and D.) So what happens when we multiply this value (let's call it k) by 0x01010101?
Well, since 0x01010101 = (1 << 24) + (1 << 16) + (1 << 8) + 1, we have:
k * 0x01010101 = (k << 24) + (k << 16) + (k << 8) + k
Thus, the highest byte of the result ends up being the sum of:
its original value, due to the k term, plus
the value of the next lower byte, due to the k << 8 term, plus
the value of the second lower byte, due to the k << 16 term, plus
the value of the fourth and lowest byte, due to the k << 24 term.
(In general, there could also be carries from lower bytes, but since we know the value of each byte is at most 8, we know the addition will never overflow and create a carry.)
That is, the highest byte of k * 0x01010101 ends up being the sum of the bitcounts of all the bytes of the input, i.e. the total bitcount of the 32-bit input number. The final >> 24 then simply shifts this value down from the highest byte to the lowest.
Ps. This code could easily be extended to 64-bit integers, simply by changing the 0x01010101 to 0x0101010101010101 and the >> 24 to >> 56. Indeed, the same method would even work for 128-bit integers; 256 bits would require adding one extra shift / add / mask step, however, since the number 256 no longer quite fits into an 8-bit byte.

I prefer this one, it's much easier to understand.
x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
x = (x & 0x0000ffff) + ((x >> 16) &0x0000ffff);

This is a comment to Ilamari's answer.
I put it as an answer because of format issues:
Line 1:
i = i - ((i >> 1) & 0x55555555); // (1)
This line is derived from this easier to understand line:
i = (i & 0x55555555) + ((i >> 1) & 0x55555555); // (2)
If we call
i = input value
j0 = i & 0x55555555
j1 = (i >> 1) & 0x55555555
k = output value
We can rewrite (1) and (2) to make the explanation clearer:
k = i - j1; // (3)
k = j0 + j1; // (4)
We want to demonstrate that (3) can be derived from (4).
i can be written as the addition of its even and odd bits (counting the lowest bit as bit 1 = odd):
i = iodd + ieven =
= (i & 0x55555555) + (i & 0xAAAAAAAA) =
= (i & modd) + (i & meven)
Since the meven mask clears the last bit of i,
the last equality can be written this way:
i = (i & modd) + ((i >> 1) & modd) << 1 =
= j0 + 2*j1
That is:
j0 = i - 2*j1 (5)
Finally, replacing (5) into (4) we achieve (3):
k = j0 + j1 = i - 2*j1 + j1 = i - j1

This is an explanation of yeer's answer:
int SWAR(unsigned int i) {
i = (i & 0x55555555) + ((i >> 1) & 0x55555555); // A
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // B
i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f); // C
i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff); // D
i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff); // E
return i;
}
Let's use Line A as the basis of my explanation.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555)
Let's rename the above expression as follows:
i = (i & mask) + ((i >> 1) & mask)
= A1 + A2
First, think of i not as 32 bits, but rather as an array of 16 groups, 2 bits each. A1 is the count array of size 16, each group containing the count of 1s at the right-most bit of the corresponding group in i:
i = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask = 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
i & mask = 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x
Similarly, A2 is "counting" the left-most bit for each group in i. Note that I can rewrite A2 = (i >> 1) & mask as A2 = (i & mask2) >> 1:
i = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask2 = 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
(i & mask2) = y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0
(i & mask2) >> 1 = 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y
(Note that mask2 = 0xaaaaaaaa)
Thus, A1 + A2 adds the counts of the A1 array and A2 array, resulting in an array of 16 groups, each group now contains the count of bits in each group.
Moving onto Line B, we can rename the line as follows:
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
= (i & mask) + ((i >> 2) & mask)
= B1 + B2
B1 + B2 follows the same "form" as A1 + A2 from before. Think of i no longer as 16 groups of 2 bits, but rather as 8 groups of 4 bits. So similar to before, B1 + B2 adds the counts of B1 and B2 together, where B1 is the counts of 1s in the right side of the group, and B2 is the counts of the left side of the group. B1 + B2 is thus the counts of bits in each group.
Lines C through E now become more easily understandable:
int SWAR(unsigned int i) {
// A: 16 groups of 2 bits, each group contains number of 1s in that group.
i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
// B: 8 groups of 4 bits, each group contains number of 1s in that group.
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
// C: 4 groups of 8 bits, each group contains number of 1s in that group.
i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f);
// D: 2 groups of 16 bits, each group contains number of 1s in that group.
i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff);
// E: 1 group of 32 bits, containing the number of 1s in that group.
i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff);
return i;
}

Related

Counting bits of ones in Byte by time Complexity O(1) C++ code

I've searched an algorithm that counts the number of ones in Byte by time complexity of O(1)
and what I found in google:
// C++ implementation of the approach
#include <bits/stdc++.h>
using namespace std;
int BitsSetTable256[256];
// Function to initialise the lookup table
void initialize()
{
// To initially generate the
// table algorithmically
BitsSetTable256[0] = 0;
for (int i = 0; i < 256; i++)
{
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
}
}
// Function to return the count
// of set bits in n
int countSetBits(int n)
{
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
}
// Driver code
int main()
{
// Initialise the lookup table
initialize();
int n = 9;
cout << countSetBits(n);
}
I understand what I need 256 size of the array (in other words size of the look up table) for indexing from 0 to 255 which they are all the decimals value that Byte represents !
but in the function initialize I didn't understand the terms inside the for loop:
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
Why Im doing that?! I didn't understand what's the purpose of this row code inside the for loop.
In addition , in the function countSetBits , this function returns:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
I didn't understand at all what Im doing and bitwise with 0xff and why Im doing right shift ..
may please anyone explain to me the concept?! I didn't understand at all why in function countSetBits at BitsSetTable256[n >> 24] we didn't do and wise by 0xff ?
I understand why I need the lookup table with size 2^8 , but the other code rows that I mentioned above didn't understand, could anyone please explain them to me in simple words? and what's purpose for counting the number of ones in Byte?
thanks alot guys!
Concerning the first part of question:
// Function to initialise the lookup table
void initialize()
{
// To initially generate the
// table algorithmically
BitsSetTable256[0] = 0;
for (int i = 0; i < 256; i++)
{
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
}
}
This is a neat kind of recursion. (Please, note I don't mean "recursive function" but recursion in a more mathematical sense.)
The seed is BitsSetTable256[0] = 0;
Then every element is initialized using the (already existing) result for i / 2 and adds 1 or 0 for this. Thereby,
1 is added if the last bit of index i is 1
0 is added if the last bit of index i is 0.
To get the value of last bit of i, i & 1 is the usual C/C++ bit mask trick.
Why is the result of BitsSetTable256[i / 2] a value to built upon?
The result of BitsSetTable256[i / 2] is the number of all bits of i the last one excluded.
Please, note that i / 2 and i >> 1 (the value (or bits) shifted to right by 1 whereby the least/last bit is dropped) are equivalent expressions (for positive numbers in the resp. range – edge cases excluded).
Concerning the other part of the question:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[n >> 24]);
n & 0xff masks out the upper bits isolating the lower 8 bits.
(n >> 8) & 0xff shifts the value of n 8 bits to right (whereby the 8 least bits are dropped) and then again masks out the upper bits isolating the lower 8 bits.
(n >> 16) & 0xff shifts the value of n 16 bits to right (whereby the 16 least bits are dropped) and then again masks out the upper bits isolating the lower 8 bits.
(n >> 24) & 0xff shifts the value of n 24 bits to right (whereby the 24 least bits are dropped) which should make effectively the upper 8 bits the lower 8 bits.
Assuming that int and unsigned have usually 32 bits on nowadays common platforms this covers all bits of n.
Please, note that the right shift of a negative value is implementation-defined.
(I recalled Bitwise shift operators to be sure.)
So, a right-shift of a negative value may fill all upper bits with 1s.
That can break BitsSetTable256[n >> 24] resulting in (n >> 24) > 256 and hence BitsSetTable256[n >> 24] an out of bound access.
The better solution would've been:
return (BitsSetTable256[n & 0xff] +
BitsSetTable256[(n >> 8) & 0xff] +
BitsSetTable256[(n >> 16) & 0xff] +
BitsSetTable256[(n >> 24) & 0xff]);
BitsSetTable256[0] = 0;
...
BitsSetTable256[i] = (i & 1) +
BitsSetTable256[i / 2];
The above code seeds the look-up table where each index contains the number of ones for the number used as index and works as:
(i & 1) gives 1 for odd numbers, otherwise 0.
An even number will have as many binary 1 as that number divided by 2.
An odd number will have one more binary 1 than that number divided by 2.
Examples:
if i==8 (1000b) then (i & 1) + BitsSetTable256[i / 2] ->
0 + BitsSetTable256[8 / 2] = 0 + index 4 (0100b) = 0 + 1 .
if i==7 (0111b) then 1 + BitsSetTable256[7 / 2] = 1 + BitsSetTable256[3] = 1 + index 3 (0011b) = 1 + 2.
If you want some formal mathematical proof why this is so, then I'm not the right person to ask, I'd poke one of the math sites for that.
As for the shift part, it's just the normal way of splitting up a 32 bit value in 4x8, portably without care about endianess (any other method to do that is highly questionable). If we un-sloppify the code, we get this:
BitsSetTable256[(n >> 0) & 0xFFu] +
BitsSetTable256[(n >> 8) & 0xFFu] +
BitsSetTable256[(n >> 16) & 0xFFu] +
BitsSetTable256[(n >> 24) & 0xFFu] ;
Each byte is shifted into the LS byte position, then masked out with a & 0xFFu byte mask.
Using bit shifts on int is however code smell and potentially buggy. To avoid poorly-defined behavior, you need to change the function to this:
#include <stdint.h>
uint32_t countSetBits (uint32_t n);
The code in countSetBits takes an int as an argument; apparently 32 bits are assumed. The implementation there is extracting four single bytes from n by shifting and masking; for these four separated bytes, the lookup is used and the number of bits per byte there are added to yield the result.
The initialization of the lookup table is a bit more tricky and can be seen as a form of dynamic programming. The entries are filled in increasing index of the argument. The first expression masks out the least significant bit and counts it; the second expression halves the argument (which could be also done by shifting). The resulting argument is smaller; it is then correctly assumed that the necessary value for the smaller argument is already available in the lookup table.
For the access to the lookup table, consider the following example:
input value (contains 5 ones):
01010000 00000010 00000100 00010000
input value, shifting is not necessary
masked with 0xff (11111111)
00000000 00000000 00000000 00010000 (contains 1 one)
input value shifted by 8
00000000 01010000 00000010 00000100
and masked with 0xff (11111111)
00000000 00000000 00000000 00000100 (contains 1 one)
input value shifted by 16
00000000 00000000 01010000 00000010
and masked with 0xff (11111111)
00000000 00000000 00000000 00000010 (contains 1 one)
input value shifted by 24,
masking is not necessary
00000000 00000000 00000000 01010000 (contains 2 ones)
The extracted values have only the lowermost 8 bits set, which means that the corresponding entries are available in the lookup table. The entries from the lookuptable are added. The underlying idea is that the number of ones in in the argument can be calculated byte-wise (in fact, any partition in bitstrings would be suitable).

Trouble understanding piece of code. Bitwise operations in c

I have the following segment of code and am having trouble deciphering what it does.
/* assume 0 <= n <=3 and 0 <= m <=3 */
int n8= n <<3;
int m8 = m <<3;
int n_mask = 0xff << n8;
int m_mask = 0xff << m8; // left bitshifts 255 by the value of m8
int n_byte = ((x & n_mask) >> n8) & 0xff;
int m_byte = ((x & m_mask) >> m8) & 0xff;
int bytes_mask = n_mask | m_mask ;
int leftover = x & ~bytes_mask;
return ( leftover | (n_byte <<m8)| (m_byte << n8) );
It swaps the nth and mth bytes.
The start has two parallel computations, one sequence with n and one sequence with m, that select the nth and mth byte like this:
Step 1: 0xff << n8
0x000000ff << 0 = 0x000000ff
.. 8 = 0x0000ff00
.. 16 = 0x00ff0000
.. 24 = 0xff000000
Step 2: x & n_mask
x = 0xDDCCBBAA
x & 0x000000ff = 0x000000AA
x & 0x0000ff00 = 0x0000BB00
x & 0x00ff0000 = 0x00CC0000
x & 0xff000000 = 0xDD000000
Step 3: ((x & n_mask) >> n8) & 0xff (note: & 0xff is required because the right shift is likely to be an arithmetic right shift, it would not be required if the code worked with unsigned integers)
n = 0: 0x000000AA
1: 0x000000BB
2: 0x000000CC
3: 0x000000DD
So it extracts the nth byte and puts it at the bottom of the integer.
The same thing is done for m.
leftover is the other (2 or 3) bytes, the ones not extracted by the previous process. There may be 3 bytes left over, because n and m can be the same.
Finally the last step is to put it all back together, but with the byte extracted from the nth position shifted to the mth position, and the mth byte shifted to the nth position, so they switch places.

Convert every 5 bits into integer values in C++

Firstly, if anyone has a better title for me, let me know.
Here is an example of the process I am trying to automate with C++
I have an array of values that appear in this format:
9C07 9385 9BC7 00 9BC3 9BC7 9385
I need to convert them to binary and then convert every 5 bits to decimal like so with the last bit being a flag:
I'll do this with only the first word here.
9C07
10011 | 10000 | 00011 | 1
19 | 16 | 3
These are actually x,y,z coordinates and the final bit determines the order they are in a '0' would make it x=19 y=16 z=3 and '1' is x=16 y=3 z=19
I already have a buffer filled with these hex values, but I have no idea where to go from here.
I assume these are integer literals, not strings?
The way to do this is with bitwise right shift (>>) and bitwise AND (&)
#include <cstdint>
struct Coordinate {
std::uint8_t x;
std::uint8_t y;
std::uint8_t z;
constexpr Coordinate(std::uint16_t n) noexcept
{
if (n & 1) { // flag
x = (n >> 6) & 0x1F; // 1 1111
y = (n >> 1) & 0x1F;
z = n >> 11;
} else {
x = n >> 11;
y = (n >> 6) & 0x1F;
z = (n >> 1) & 0x1F;
}
}
};
The following code would extract the three coordinates and the flag from the 16 least significant bits of value (ie. its least significant word).
int flag = value & 1; // keep only the least significant bit
value >>= 1; // shift right by one bit
int third_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits
int second_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits
int first_integer = value & 0x1f; // keep only the five least significant bits
value >>= 5; // shift right by five bits (only useful if there are other words in "value")
What you need is most likely some loop doing this on each word of your array.

8-digit BCD check

I've a 8-digit BCD number and need to check it out to see if it is a valid BCD number. How can I programmatically (C/C++) make this?
Ex: 0x12345678 is valid, but 0x00f00abc isn't.
Thanks in advance!
You need to check each 4-bit quantity to make sure it's less than 10. For efficiency you want to work on as many bits as you can at a single time.
Here I break the digits apart to leave a zero between each one, then add 6 to each and check for overflow.
uint32_t highs = (value & 0xf0f0f0f0) >> 4;
uint32_t lows = value & 0x0f0f0f0f;
bool invalid = (((highs + 0x06060606) | (lows + 0x06060606)) & 0xf0f0f0f0) != 0;
Edit: actually we can do slightly better. It doesn't take 4 bits to detect overflow, only 1. If we divide all the digits by 2, it frees a bit and we can check all the digits at once.
uint32_t halfdigits = (value >> 1) & 0x77777777;
bool invalid = ((halfdigits + 0x33333333) & 0x88888888) != 0;
The obvious way to do this is:
/* returns 1 if x is valid BCD */
int
isvalidbcd (uint32_t x)
{
for (; x; x = x>>4)
{
if ((x & 0xf) >= 0xa)
return 0;
}
return 1;
}
This link tells you all about BCD, and recommends something like this asa more optimised solution (reworking to check all the digits, and hence using a 64 bit data type, and untested):
/* returns 1 if x is valid BCD */
int
isvalidbcd (uint32_t x)
{
return !!(((uint64_t)x + 0x66666666ULL) ^ (uint64_t)x) & 0x111111110ULL;
}
For a digit to be invalid, it needs to be 10-15. That in turn means 8 + 4 or 8+2 - the low bit doesn't matter at all.
So:
long mask8 = value & 0x88888888;
long mask4 = value & 0x44444444;
long mask2 = value & 0x22222222;
return ((mask8 >> 2) & ((mask4 >>1) | mask2) == 0;
Slightly less obvious:
long mask8 = (value>>2);
long mask42 = (value | (value>>1);
return (mask8 & mask42 & 0x22222222) == 0;
By shifting before masking, we don't need 3 different masks.
Inspired by #Mark Ransom
bool invalid = (0x88888888 & (((value & 0xEEEEEEEE) >> 1) + (0x66666666 >> 1))) != 0;
// or
bool valid = !((((value & 0xEEEEEEEEu) >> 1) + 0x33333333) & 0x88888888);
Mask off each BCD digit's 1's place, shift right, then add 6 and check for BCD digit overflow.
How this works:
By adding +6 to each digit, we look for an overflow * of the 4-digit sum.
abcd
+ 110
-----
*efgd
But the bit value of d does not contribute to the sum, so first mask off that bit and shift right. Now the overflow bit is in the 8's place. This all is done in parallel and we mask these carry bits with 0x88888888 and test if any are set.
0abc
+ 11
-----
*efg

Number of bits set in a number

The following the magical formula which gives the number of bits set in a number (Hamming weight).
/*Code to Calculate count of set bits in a number*/
int c;
int v = 7;
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
printf(" Number of Bits is %d",c);
/*-----------------------------------*/
from:
http://graphics.stanford.edu/~seander/bithacks.html
Can anyone please explain me the rationale behind this?
It's really quite clever code, and is obviously a lot more difficult to understand than a simple naive loop.
For the first line, let's just take a four-bit quantity, and call it abcd. The code basically does this:
abcd - ((abcd >> 1) & 0101) = abcd - (0abc & 0101) = abcd - 0a0c
So, in each group of two bits, it subtracts the value of the high bit. What does that net us?
11 - 1 -> 10 (two bits set)
10 - 1 -> 01 (one bit set)
01 - 0 -> 01 (one bit set)
00 - 0 -> 00 (zero bits set)
So, that first line sets each consecutive group of two bits to the number of bits contained in the original value -- it counts the bits set in groups of two. Call the resulting four-bit quantity ABCD.
The next line:
(ABCD & 0011) + ((ABCD>>2) & 0011) = 00CD + (AB & 0011) = 00CD + 00AB
So, it takes the groups of two bits and adds pairs together. Now, each four-bit group contains the number of bits set in the corresponding four bits of the input.
In the next line, v + (v >> 4) & 0xF0F0F0F (which is parsed as (v + (v >> 4)) & 0xf0f0f0f) does the same, adding pairs of four-bit groups together so that each eight-bit group (byte) contains the bit-set count of the corresponding input byte. We now have a number like 0x0e0f0g0h.
Note that multiplying a byte in any position by 0x01010101 will copy that byte up to the most-significant byte (as well as leaving some copies in lower bytes). For example, 0x00000g00 * 0x01010101 = 0x0g0g0g00. So, by multiplying 0x0e0f0g0h, we will leave e+f+g+h in the topmost byte; the >>24 at the end extracts that byte and leaves you with the answer.
One liner solution in python for counting number of one's in a given binary number
[i for i in str(bin(n)) if i=="1"].count("1")