How to calculate the set bit positions in a number? - bit-manipulation

If n = 100011 in binary, then I want to retrieve the positions of set bits which in this case are 1,5,6 when measured from left to right.
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?

In the most common convention, a binary number is written in the order as a number in other common positional representations (decimal etc): with the least significant digit in the rightmost position. It also makes more sense to label that digit as "digit zero", so that the label of every digit corresponds with the exponent in the associated weight (eg bit 0 has weight 20=1 and so forth). This doesn't really matter, it's easy enough to re-number the digits, but it's usually easier to follow the conventions.
Since you asked
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?
I will address that portion of the question. Checking the bits one by one is not completely disastrous however. Even for BigInts. The number of results could be as high as the number of bits anyway. For numbers known to be sparse, there is still not much that can be done - every bit has to be checked somehow because if any bit is ignored completely, that bit might have been set and we'd miss it. But in the context of a machine word, there are tricks, for example based on find-first-set.
Using the find-first-set function (or count trailing zeroes), the index of the set bit with the lowest index can be found in one step (if you accept this function as being one step, which is a reasonable assumption on most hardware, and in theory you can just define it to be one step), and then that bit can be removed so the next find-first-set will find the index of the next bit. For example:
while bitmask != 0:
yield return find-first-set(bitmask)
bitmask &= bitmask - 1 // remove lowest set bit
This is easy to adapt to BigInts, just do this on every limb of the number and add the appropriate offset.

To do that you use masks.
Each position from right to left is a power of two.
For example 0101 is 1*2ˆ0 + 0*2ˆ1 + 1*2ˆ2 + 0*1ˆ3 = 1+0+4+0 = 5
Then to check if these two bits are on against a bytesToTest variable you AND with 5: byteToTest & 5 == 5
Given that 1 & 0 = 0 and 1 & 1 = 1
If bytesToTest is 1111 then 1111 & 0101 will give 0101
If bytesToTest is 1010 then 1010 & 0101 will give 0000
Following this reasoning for the particular case of 100011
To retrieve 1, 5, and 6 from left to right (the three ones set to 1)
The mask is: 1+2+32 = 35
With this information you should be able to define individual masks for each bit, test one by one, and be able to answer in which position you find bits that are on and in which bits that are off.


How to negate only least significant bits?

Starting from the position of the highest order 1 bit, how can I negate that bit and all lower order bits?
Example (C#)
int inputNumber = 0b_10001;
inputNumber = ~inputNumber; // bitwise complement
int expectedNumber = 0b_000000000000000000000000000_01110;
uint actualNumber = 0b_111111111111111111111111111_01110; // almost, but not quite
0b is just a marker to start writing the number in binary. The _ is just a separator to make it visually easier to follow (but still valid C# code). Example: 0b_1010_1111 is actually 10101111 in binary.
I feel I'm close to the solution - I need to make a mask to get rid of the unwanted bits, but I'm not sure how.
How can I get from 10001 to 01110, basically negating each bit, but without having the leading 1s?

next number in which last set bit is same

Let's take a number 2(0010) I want another number in which last bit set should be same as in 2 which is at first position(k=1)
According to the rule we first use the formula x&-x to extract the last set bit of any number and add it to the original number
But here in this case last set bit is same as the number and we if we apply the formula in this case the we will get 4(0100) on adding 0010 and 0010
but we should get 6(0110)

Create a sequence which is ordered by bits set

I'm looking for a reversible function unsigned f(unsigned) for which the number of bits set in f(i) increases with i, or at least does not decrease. Obviously, f(0) has to be 0 then, and f(~0) must come last. In between there's more flexibility. After f(0), the next 32* values must be 1U<<0 to 1U<<31, but I don't care a lot about the order (they all have 1 bit set).
I'd like an algorithm which doesn't need to calculate f(0)...f(i-1) in order to calculate f(i), and a complete table is also unworkable.
This is similar to Gray codes, but I can't see a way to reuse that algorithm. I'm trying to use this to label a large data set, and prioritize the order in which I search them. The idea is that I have a key C, and I'll check labels C ^ f(i). Low values of i should give me labels similar to C, i.e. differing in only a few bits.
[*] Bonus points for not assuming that unsigned has 32 bits.
A valid initial sequence:
0, 1, 2, 4, 16, 8 ... // 16 and 8 both have one bit set, so they compare equal
An invalid initial sequence:
0, 1, 2, 3, 4 ... // 3 has two bits set, so it cannot precede 4 or 2147483648.
Ok, seems like I have a reasonable answer. First let's define binom(n,k) as the number of ways in which we can set k out of n bits. That's the classic Pascal triangle:
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
Easily calculated and cached. Note that the sum of each line is 1<<lineNumber.
The next thing we'll need is the partial_sum of that triangle:
1 2
1 3 4
1 4 7 8
1 5 11 15 16
1 6 16 26 31 32
1 7 22 42 57 63 64
1 8 29 64 99 120 127 128
1 9 37 93 163 219 247 255 256
Again, this table can be created by summing two values from the previous line, except that the new entry on each line is now 1<<line instead of 1.
Let's use these tables above to construct f(x) for an 8 bits number (it trivially generalizes to any number of bits). f(0) still has to be 0. Looking up the 8th row in the first triangle, we see that next 8 entries are f(1) to f(9), all with one bit set. The next 28 entries (7+6+5+4+3+2+1) all have 2 bits set, so that's f(10) to f(37). The next 56 entries, f(38) to f(93) have 3 bits, and there are 70 entries with 4 bits set. From symmetry we can see that they're centered around f(128), in particular they're f(94) to f(163). And obviously, the only number with 8 bits set sorts last, as f(255).
So, with these tables we can quickly determine how many bits must be set in f(i). Just do a binary search in the last row of your table. But that doesn't answer exactly which bits are set. For that we need the previous rows.
The reason that each value in the table can be created from the previous line is simple. binom(n,k) == binom(k, n-1) + binom(k-1, n-1). There are two sorts of numbers with k bits set: Those that start with a 0... and numbers which start with 1.... In the first case, the next n-1 bits must contain those k bits, in the second case the next n-1 bits must contain only k-1 bits. Special cases are of course 0 out of n and n out of n.
This same stucture can be used to quickly tell us what f(16) must be. We already had established that it must contain 2 bits set, as it falls in the range f(10) - f(37). In particular, it's number 6 with 2 bits set (starting as usual with 0). It's useful to define this as an offset in a range as we'll try to shrink the length this range from 28 down to 1.
We now subdivide that range into 21 values which start with a zero and 7 which start a one. Since 6 < 21, we know that the first digit is a zero. Of the remaining 7 bits, still 2 need to be set, so we move up a line in the triangle and see that 15 values start with two zeroes, and 6 start with 01. Since 6 < 15, f(16) starts with 00. Going further up, 7 <= 10 so it starts with 000. But 6 == 6, so it doesn't start with 0000 but 0001. At this point we change the start of the range, so the new offset becomes 0 (6-6)
We know need can focus only on the numbers that start with 0001 and have one extra bit, which are f(16)...f(19). It should be obvious by know that the range is f(16)=00010001, f(17)=00010010, f(18)=00010100, f(19)=00011000.
So, to calculate each bit, we move one row up in the triangle, compare our "remainder", add a zero or one based on the comparison possibly go left one column. That means the computational complexity of f(x) is O(bits), or O(log N), and the storage needed is O(bits*bits).
For each given number k we know that there are binom(n, k) n-bit integers that have exactly k bits of value one. We can now generate a lookup table of n + 1 integers that store for each k how many numbers have less one bits. This lookup table can then be used to find the number o of one bits of f(i).
Once we know this number we subtract the lookup table value for this number of bits from i which leaves us with the permutation index p for numbers with the given number of one bits. Altough I have not done research in this area I am quite sure that there exists a method for finding the pth permutation of a std::vector<bool> which is initialized with zeros and o ones in the lowest bits.
The reverse function
Again the lookup table comes in handy. We can directly calculate the number of preceding numbers with less one bits by counting the one bits in the input integer and reading in the lookup table. Then you "only" need to determine the permutation index and add it to the looked up value and you are done.
Of course this is only a rough outline and some parts (especially involving the permutations) might take longer than it sounds.
You stated yourself
I'm trying to use this to label a large data set, and prioritize the order in which I search them.
Which sounds to me as if you would be going from the low hamming distance to the high hamming distance. In this case it would be enough to have an incremental version which generates the next number from the previous:
unsigned next(unsigned previous)
return previous;
return (1 << (1 + countOneBits(previous))) - 1;
Of course std::next_permutation permutation does not work this way but I think it is clear how I mean to use it.

Hamming SEC/DED extra parity bit

I'm having some troubles with the SEC/DED error correction code. It seems I've found some cases in which the decoder thinks a double bit flip occured but only one really occured. I suppose I did somthing wrong, but I was not able to understand what.
Let me show you an example.
Suppose I want to encode the 4 bits 1011 using a (7,4) code plus an extra bit needed to perform the two-error-detection. The coded word should be 00110011, where the most significant bit is the extra parity bit, the following two are p0 and p1 and so on.
Now, let's suppose that during a transmission the less significant bit is flipped; thus the received word will be 00110010. The receiver will extract from this code the four received data bits 1010 and will construct a new code which will result 01011010. Finally the receiver will perform a bitwise xor of the two codes obtaining 0111. The last three bits says that bit 7 has been flipped (which is right), but the first bit is 0 and, as far as i know, the decoder should consider this situation as if more than a bit flip has occured.
What did I do wrong?
I think I've solved the problem.
In the example above I calculate the syndrome and then I compute a new overall parity bit of the resultant codeword. Instead, I should check the overall parity of the received word and set the error_happened boolean to that value; then calculate the syndrome.

Find a repeating symmetric bit pattern in a small stream of 128 bits

How can I quickly scan groups of 128 bits that are exact equal repeating binary patterns, such 010101... Or 0011001100...?
I have a number of 128 bit blocks, and wish to see if they match the patterns where the number of 1s is equal to number of 0s, eg 010101.... Or 00110011... Or 0000111100001111... But NOT 001001001...
The problem is that patterns may not start on their boundary, so the pattern 00110011.. May begin as 0110011..., and will end 1 bit shifted also (note the 128 bits are not circular, so start doesn't join to the end)
The 010101... Case is easy, it is simply 0xAAAA... Or 0x5555.... However as the patterns get longer, the permutations get longer. Currently I use repeating shifting values such as outlined in this question Fastest way to scan for bit pattern in a stream of bits but something quicker would be nice, as I'm spending 70% of all CPU in this routine. Other posters have solutions for general cases but I am hoping the symmetric nature of my pattern might lead to something more optimal.
If it helps, I am only interested in patterns up to 63 bits long, and most interested in the power of 2 patterns (0101... 00110011... 0000111100001111... Etc) while patterns such as 5 ones/5 zeros are present, these non power 2 sequences are less than 0.1%, so can be ignored if it helps the common cases go quicker.
Other constraints for a perfect solution would be small number of assembler instructions, no wildly random memory access (ie, large rainbow tables not ideal).
Edit. More precise pattern details.
I am mostly interested in the patterns of 0011 and 0000,1111 and 0000,0000,1111,1111 and 16zeros/ones and 32 zeros/ones (commas for readabily only) where each pattern repeats continuously within the 128 bits. Patterns that are not 2,4,8,16,32 bits long for the repeating portion are not as interesting and can be ignored. ( eg 000111... )
The complexity for scanning is that the pattern may start at any position, not just on the 01 or 10 transition. So for example, all of the following would match the 4 bit repeating pattern of 00001111... (commas every 4th bit for readability) (ellipses means repeats identically)
0000,1111.... Or 0001,1110... Or 0011,1100... Or 0111,1000... Or 1111,0000... Or 1110,0001... Or 1100,0011... Or 1000,0111
Within the 128bits, the same pattern needs to repeat, two different patterns being present is not of interest. Eg this is NOT a valid pattern. 0000,1111,0011,0011... As we have changed from 4 bits repeating to 2 bits repeating.
I have already verified the number of 1s is 64, which is true for all power 2 patterns, and now need to identify how many bits make up the repeating pattern (2,4,8,16,32) and how much the pattern is shifted. Eg pattern 0000,1111 is a 4 bit pattern, shifted 0. While 0111,1000... Is a 4 bit pattern shifted 3.
Lets start with the case where the patterns do start on their boundary. You can check the first bit and use it to determine your state. Then start looping through your block, check the first bit, increment a count, left shift and repeat until you find that you've gotten the opposite bit. You can now use this initial length as the bitset length. Reset the count to 1 then count the next set of opposite bits. When you switch, check the length against the initial length and error out if they're not equal. Here's a quick function - it seems to work as expected for chars, and it shouldn't be too hard to expand it to deal with blocks of 32 bytes.
unsigned char myblock = 0x33;
unsigned char mask = 0x80, prod = 0x00;
int setlen = 0, count = 0, ones=0;
prod = myblock & mask;
if(prod == 0x80)
ones = 1;
for(int i=0;i<8;i++){
prod = myblock & mask;
myblock = myblock << 1;
if((prod == 0x80 && ones) || (prod == 0x00 && !ones)){
if(setlen == 0) setlen = count;
if(count != setlen){
printf("Bad block\n");
return -1;
count = 1;
ones = ( ones == 1 ) ? 0 : 1;
printf("Good block of with % repeating bits\n",setlen);
return setlen;
Now to deal with blocks where there's an offset, I'd suggest counting the number of bits until the first 'flip'. Store this number, then run the above routine until you hit the last segment which should have length unequal to the rest of the sets. Add the initial bits to the last segment's length, and then you should be able to compare it with the size of the rest of the sets correctly.
This code is pretty small, and bit shifting through a buffer shouldn't require too much work on the CPU's part. I'd be interested to see how this solution ends up performing against your current one.
The Generic solution for this kind of problems is to create a good hashing function for the patterns and store each pattern in a hash map. Once you have the hash map created for the patterns then try to lookup in the table using the input stream. I don't have code yet but let me know if you are struck in code.. Please post it and I can work on it..
I've thought about making a state machine, so every next byte (out of 16) would advance its state and after some 16 state transitions you'd have the pattern identified. But that doesn't look very promising. Data structures and logic look more complex.
Instead, why not precompute all those 126 patterns (from 01 to 32 zeroes + 32 ones), sort them and perform binary search? That would give you at most 7 iterations of binary search. And you don't need to store all 16 bytes of every pattern as its halves are identical. That gives you 126*16/2=1008 bytes for the array of patterns. You also need something like 2 bytes per pattern to store the length of zero (one) runs and the shift relative to whatever pattern you consider unshifted. That's a total of 126*(16/2+2)=1260 bytes of data (should be gentle on the data cache) and very simple and tiny binary search algorithm. Basically, its just an improvement over the answer that you mentioned in the question.
You might want to try switching to linear search after 4-5 iterations of binary search. That may give a small boost to the overall algorithm.
Ultimately, the winner is determined by testing/profiling. And that's what you should do, get a few implementations and compare them on the real data in the real system.
The restriction of the pattern repeating it self all over the 128-stream makes the number of combinations limited and also the sequence will have properties making it easy to check:
One needs to iteratively check if high and low parts are same; if they are opposites, check if that particular length contains consecutive ones.
8-bit repeat at offset 3: 00011111 11100000 00011111 11100000
==> high and low 16 bits are the same
00011111 11100000 ==> high and low parts are inverted.
Not same, nor inverted means rejection of pattern.
At that point one needs to check if there's a sequence of ones -- add '1' to the left side and check if it's power of two: n==(n & -n) is the textbook check for that.