Create a sequence which is ordered by bits set

Create a sequence which is ordered by bits set - c++

I'm looking for a reversible function unsigned f(unsigned) for which the number of bits set in f(i) increases with i, or at least does not decrease. Obviously, f(0) has to be 0 then, and f(~0) must come last. In between there's more flexibility. After f(0), the next 32* values must be 1U<<0 to 1U<<31, but I don't care a lot about the order (they all have 1 bit set).
I'd like an algorithm which doesn't need to calculate f(0)...f(i-1) in order to calculate f(i), and a complete table is also unworkable.
This is similar to Gray codes, but I can't see a way to reuse that algorithm. I'm trying to use this to label a large data set, and prioritize the order in which I search them. The idea is that I have a key C, and I'll check labels C ^ f(i). Low values of i should give me labels similar to C, i.e. differing in only a few bits.
[*] Bonus points for not assuming that unsigned has 32 bits.
[example]
A valid initial sequence:
0, 1, 2, 4, 16, 8 ... // 16 and 8 both have one bit set, so they compare equal
An invalid initial sequence:
0, 1, 2, 3, 4 ... // 3 has two bits set, so it cannot precede 4 or 2147483648.

Ok, seems like I have a reasonable answer. First let's define binom(n,k) as the number of ways in which we can set k out of n bits. That's the classic Pascal triangle:
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
...
Easily calculated and cached. Note that the sum of each line is 1<<lineNumber.
The next thing we'll need is the partial_sum of that triangle:
1 2
1 3 4
1 4 7 8
1 5 11 15 16
1 6 16 26 31 32
1 7 22 42 57 63 64
1 8 29 64 99 120 127 128
1 9 37 93 163 219 247 255 256
...
Again, this table can be created by summing two values from the previous line, except that the new entry on each line is now 1<<line instead of 1.
Let's use these tables above to construct f(x) for an 8 bits number (it trivially generalizes to any number of bits). f(0) still has to be 0. Looking up the 8th row in the first triangle, we see that next 8 entries are f(1) to f(9), all with one bit set. The next 28 entries (7+6+5+4+3+2+1) all have 2 bits set, so that's f(10) to f(37). The next 56 entries, f(38) to f(93) have 3 bits, and there are 70 entries with 4 bits set. From symmetry we can see that they're centered around f(128), in particular they're f(94) to f(163). And obviously, the only number with 8 bits set sorts last, as f(255).
So, with these tables we can quickly determine how many bits must be set in f(i). Just do a binary search in the last row of your table. But that doesn't answer exactly which bits are set. For that we need the previous rows.
The reason that each value in the table can be created from the previous line is simple. binom(n,k) == binom(k, n-1) + binom(k-1, n-1). There are two sorts of numbers with k bits set: Those that start with a 0... and numbers which start with 1.... In the first case, the next n-1 bits must contain those k bits, in the second case the next n-1 bits must contain only k-1 bits. Special cases are of course 0 out of n and n out of n.
This same stucture can be used to quickly tell us what f(16) must be. We already had established that it must contain 2 bits set, as it falls in the range f(10) - f(37). In particular, it's number 6 with 2 bits set (starting as usual with 0). It's useful to define this as an offset in a range as we'll try to shrink the length this range from 28 down to 1.
We now subdivide that range into 21 values which start with a zero and 7 which start a one. Since 6 < 21, we know that the first digit is a zero. Of the remaining 7 bits, still 2 need to be set, so we move up a line in the triangle and see that 15 values start with two zeroes, and 6 start with 01. Since 6 < 15, f(16) starts with 00. Going further up, 7 <= 10 so it starts with 000. But 6 == 6, so it doesn't start with 0000 but 0001. At this point we change the start of the range, so the new offset becomes 0 (6-6)
We know need can focus only on the numbers that start with 0001 and have one extra bit, which are f(16)...f(19). It should be obvious by know that the range is f(16)=00010001, f(17)=00010010, f(18)=00010100, f(19)=00011000.
So, to calculate each bit, we move one row up in the triangle, compare our "remainder", add a zero or one based on the comparison possibly go left one column. That means the computational complexity of f(x) is O(bits), or O(log N), and the storage needed is O(bits*bits).

For each given number k we know that there are binom(n, k) n-bit integers that have exactly k bits of value one. We can now generate a lookup table of n + 1 integers that store for each k how many numbers have less one bits. This lookup table can then be used to find the number o of one bits of f(i).
Once we know this number we subtract the lookup table value for this number of bits from i which leaves us with the permutation index p for numbers with the given number of one bits. Altough I have not done research in this area I am quite sure that there exists a method for finding the pth permutation of a std::vector<bool> which is initialized with zeros and o ones in the lowest bits.
The reverse function
Again the lookup table comes in handy. We can directly calculate the number of preceding numbers with less one bits by counting the one bits in the input integer and reading in the lookup table. Then you "only" need to determine the permutation index and add it to the looked up value and you are done.
Disclaimer
Of course this is only a rough outline and some parts (especially involving the permutations) might take longer than it sounds.
Addition
You stated yourself
I'm trying to use this to label a large data set, and prioritize the order in which I search them.
Which sounds to me as if you would be going from the low hamming distance to the high hamming distance. In this case it would be enough to have an incremental version which generates the next number from the previous:
unsigned next(unsigned previous)
{
if(std::next_permutation(previous))
return previous;
else
return (1 << (1 + countOneBits(previous))) - 1;
}
Of course std::next_permutation permutation does not work this way but I think it is clear how I mean to use it.

Related

Continuously multiply it by itself? such as 2^(100)

I want to print a list of numbers that show exponent of 2 to the 100th power such as 2,4,8,16,32,64,128.... up to the 100th power of two using a FOR loop and printing it. I basically want to multiply 2x2x2...100 times
for i in range(1,101,1):
print (i)
2
4
8
16
32
64
128
256
512
etc.. all the way to 2^(100)

Slightly faster
power=2
for i in range(1,101,1):
power*=2
print(power)

How to calculate the set bit positions in a number?

If n = 100011 in binary, then I want to retrieve the positions of set bits which in this case are 1,5,6 when measured from left to right.
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?

In the most common convention, a binary number is written in the order as a number in other common positional representations (decimal etc): with the least significant digit in the rightmost position. It also makes more sense to label that digit as "digit zero", so that the label of every digit corresponds with the exponent in the associated weight (eg bit 0 has weight 20=1 and so forth). This doesn't really matter, it's easy enough to re-number the digits, but it's usually easier to follow the conventions.
Since you asked
How to calculate such positions without literally checking for bit is zero or not by going to every bit position?
I will address that portion of the question. Checking the bits one by one is not completely disastrous however. Even for BigInts. The number of results could be as high as the number of bits anyway. For numbers known to be sparse, there is still not much that can be done - every bit has to be checked somehow because if any bit is ignored completely, that bit might have been set and we'd miss it. But in the context of a machine word, there are tricks, for example based on find-first-set.
Using the find-first-set function (or count trailing zeroes), the index of the set bit with the lowest index can be found in one step (if you accept this function as being one step, which is a reasonable assumption on most hardware, and in theory you can just define it to be one step), and then that bit can be removed so the next find-first-set will find the index of the next bit. For example:
while bitmask != 0:
yield return find-first-set(bitmask)
bitmask &= bitmask - 1 // remove lowest set bit
This is easy to adapt to BigInts, just do this on every limb of the number and add the appropriate offset.

To do that you use masks.
Each position from right to left is a power of two.
For example 0101 is 1*2ˆ0 + 0*2ˆ1 + 1*2ˆ2 + 0*1ˆ3 = 1+0+4+0 = 5
Then to check if these two bits are on against a bytesToTest variable you AND with 5: byteToTest & 5 == 5
Given that 1 & 0 = 0 and 1 & 1 = 1
If bytesToTest is 1111 then 1111 & 0101 will give 0101
If bytesToTest is 1010 then 1010 & 0101 will give 0000
Following this reasoning for the particular case of 100011
To retrieve 1, 5, and 6 from left to right (the three ones set to 1)
The mask is: 1+2+32 = 35
With this information you should be able to define individual masks for each bit, test one by one, and be able to answer in which position you find bits that are on and in which bits that are off.

Minimum description length and Huffman coding for two symbols?

I am confused about the interpretation of the minimum description length of an alphabet of two symbols.
To be more concrete, suppose that we want to encode a binary string where 1's occur with probability 0.80; for instance, here is a string of length 40, with 32 1's and 8 0's:
1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1
Following standard MDL analysis, we can encode this string using prefix codes (like Huffman's) and the code of encoding this string would be (-log(0.8) * 32 - log(0.2) * 8), which is lower than duplicating the string without any encoding.
Intuitively, it is "cheaper" to encode this string than some string where 1's and 0's occur with equal probability. However, in practice, I don't see why this would be the case. At the very least, we need one bit to distinguish between 1's and 0's. I don't see how prefix codes could do better than just writing the binary string without encoding.
Can someone help me clarify this, please?

I don't see how prefix codes could do better than just writing the
binary string without encoding.
You can't with prefix codes, unless you combine bits to make more symbols. For example, if you code every two bits, you now have four symbols with probabilities 0.64, 0.16, 0.16, and 0.04. That would be coded with 0, 10, 110, 111. That gives an average of 1.56 bits per symbol, or 0.7800 bits per original bit. We're getting a little closer to the optimal 0.7219 bits per bit (-0.2 log20.2 - 0.8 log20.8).
Do that for three-bit groupings, and you get 0.7280 bits per bit. Surprisingly close to the optimum. In this case, the code lengths just happen to group really nicely with the probabilities. The code is 1 bit (0) for the symbol with probability 0.512, 3 bits (100, 101, 110) for the three symbols with probability 0.128, and 5 bits (11100, 11101, 11110, 11111) for both the three symbols with probability 0.032 and the one symbol with probability 0.008.
You can keep going and get asymptotically close to the optimal 0.7219 bits per bit. Though it becomes more inefficient in time and space for larger groupings. The Pareto Front turns out to be at multiples of three bits up through 15. 6 bits gives 0.7252 bits per bit, 9 gives 0.7251, 12 is 0.7250, and 15 is 0.7249. The approach is monumentally slow, where you need to go to 28 bits to get to 0.7221. So you might as well stop at 6. Or even just 3 is pretty good.
Alternatively you can use something other than prefix coding, such as arithmetic coding, range coding, or asymmetric numeral system coding. They effectively use fractional bits for each symbol.

Huffman code with lookup table

I have read an article on Internet and know that the natural way of decoding by traversing from root but I want to do it faster with a lookup table.
After reading, I still cannot get the points.
ex:
input:"abcdaabaaabaaa"
code data
0 a
10 b
110 c
111 d
The article says that due to variable length, it determine the length by taking a bit of string of
max code length and use it as index.
output:"010110111001000010000"
Index Index(binary) Code Bits required
0 000 a 1
1 001 a 1
2 010 a 1
3 011 a 1
4 100 b 2
5 101 b 2
6 110 c 3
7 111 d 3
My questions are:
What does it means due to variable length, it determine the length by taking a bit of string of
max code length? How to determine the length?
How to generate the lookup table and how to use it? What is the algorithm behind?

For your example, the maximum code length is 3 bits. So you take the first 3 bits from your stream (010) and use that to index the table. This gives code, 'a' and bits = 1. You consume 1 bit from your input stream, output the code, and carry on. On the second go around you will get (101), which indexes as 'b' and 2 bits, etc.
To build the table, make it as large as 1 << max_code_length, and fill in details as if you are decoding the index as a huffman code. If you look at your example all the indices which begin '0' are a, indices beginning '10' are b, and so on.

Prime backwards in order

An emirp (prime spelled backwards) is a pime number whose reversal is also prime. Ex. 17 & 71. I have to write a program that displays the first 100 emirps. It has to display 10 numbers per line and align the numbers properly:
2 3 5 7 11 13 17 31 37 71
73 79 97 101 107 113 131 149 151 157.
I have no cue what I am doing and would love if anyone could dump this down for me.

It sounds like there are two general problems:
Finding the emirps.
Formatting the output as required.
Break down your tasks into smaller parts, and then you'll be able to see more clearly how to get the whole task done.
To find the emirps, first write some helper functions:
is_prime() to determine whether a number is prime or not
reverse_digits() to reverse the digits of any number
Combining these two functions, you can imagine a loop that finds all the numbers that are primes both forward and reversed. Your first task is complete when you can simply generate a list of those numbers, printing them to the console one per line.
Next, work out what format you want to use (it looks like a fixed format of some number of character spaces per number is what you need). You know that you have 100 numbers, 10 per line, so working out how to format the numbers should be straightforward.

Break the problem down into simpler sub-problems:
Firstly, you need to check whether a number is prime. This is such a common task that you can easily Google it - or try a naive implementation yourself, which may be better given that this is homework.
Secondly, you need to reverse the digits of a number. I'd suggest you figure out an algorithm for this on a piece of paper first, then implement it in code.
Put the two together - it shouldn't be that hard.
Format the results properly. Printing 10 numbers per line is something you should be able to figure out easily once the rest is done.
Once you have a simple version working you might be able to optimise it in some way.

A straight forward way of checking if a number is prime is by trying all known primes less than it and seeing if it divides evenly into that number.
Example: To find the first couple of primes
Start off with the number 2, it is prime because the only divisors are itself and 1, meaning the only way to multiple two numbers to get 2 is 2 x 1. Likewise for 3.
So that starts us off with two known primes 2 and 3. To check the next number we can check if 4 modulo 2 equals 0. This means when divide 2 into 4 there is no remainder, which means 2 is a factor of 4. Specifically we know 2 x 2 = 4. Thus 4 is not prime.
Moving on to the next number: 5. To check five we try 5 modulo 2 and 5 modulo 3, both of which equals one. So 5 is prime, add it to our list of known primes and then continue on checking the next number. This rather tedious process is great for a computer.
So on and so forth - check the next number by looping through all previous found primes and check if they divide evenly, if all previously found primes don't divide evenly, you have a new prime. Repeat.
You can speed this up by counting by 2's, since all even numbers are divisible by two. Also, another nice trick is you don't have to check any primes greater than the square root of the number, since anything larger would need a smaller prime factor. Cuts your loops in half.
So that is your algorithm to generate a large list of primes.
Collect a good chunk of them in an array, say the first 10,000 or so. And then loop through them, reverse the numbers and see if the result is in your array. If so you have a emirp, continue until you get the first 100 emirps
If the first 10,000 primes don't return 100 emirps. Move on to the next 10,000. Repeat.

For homework, I would use a fairly simplistic isPrime function, pseudo-code along the lines of:
def isPrime (num):
set testDiv1 to 2
while testDiv1 multiplied by testDiv1 is less than or equal to num:
testDiv2 = integer part of (num divided by testDiv1)
if testDiv1 multiplied by testDiv2 is equal to num:
return true
Add 1 to testDiv1
return false
This basically checks whether the number is evenly divisible by any number between 2 and the square root of the number, a primitive primality check. The reson you stop at the square root is because you would have already found a match below it if there was one above it.
For example 100 is 2 times 50, 4 times 25, 5 time 20 and 10 times 10. The next one after that would be 20 times 5 but you don't need to check 20 since it would have been found when you checked 5. Any positive number can be expressed as a product of two other positive numbers, one below the square root and one above (other than the exact square root case of course).
The next tricky bit is the reversal of digits. C has some nice features which will make this easier for you, the pseudo-code is basically:
def reverseDigits (num):
set newNum to zero
while num is not equal to zero:
multiply newnum by ten
add (num modulo ten) to newnum
set num to the integer part of (num divided by ten)
return newNum
In C, you can use int() for integer parts and % for the modulo operator (what's left over when you divide something by something else - like 47 % 10 is 7, 9 % 4 is 1, 1000 % 10 is 0 and so on).
The isEmirp will be a fairly simplistic:
def isEmirp (num):
if not isPrime (num):
return false
num2 = reverseDigits (num)
if not isPrime (num2):
return false
return true
Then at the top level, your code will look something like:
def mainProg:
create array of twenty emirps
set currEmirp to zero
set tryNum to two
while currEmirp is less than twenty
if isEmirp (tryNum):
put tryNum into emirps array at position currEmirp
add 1 to currEmirp
for currEmirp ranging from 0 to 9:
print formatted emirps array at position currEmirp
print new line
for currEmirp ranging from 10 to 19:
print formatted emirps array at position currEmirp
print new line
Right, you should be able to get some usable code out of that, I hope. If you have any questions of the translation, leave a comment and I'll provide pointers for you, rather than solving it or doing the actual work.
You'll learn a great deal more if you try yourself, even if you have a lot of trouble initially.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js