Optimal way to compress 60 bit string - compression

Given 15 random hexadecimal numbers (60 bits) where there is always at least 1 duplicate in every 20 bit run (5 hexdecimals).
What is the optimal way to compress the bytes?
Here are some examples:
01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130
Initially I've been trying to use Huffman encoding on just 20 bits because huffman coding can go from 20 bits down to ~10 bits but storing the table takes more than 9 bits.
Here is the breakdown showing 20 bits -> 10 bits for 01230
Character Frequency Assignment Space Savings
0 2 0 2×4 - 2×1 = 6 bits
2 1 10 1×4 - 1×2 = 2 bits
1 1 110 1×4 - 1×3 = 1 bits
3 1 111 1×4 - 1×3 = 1 bits
I then tried to do huffman encoding on all 300 bits (five 60bit runs) and here is the mapping given the above example:
Character Frequency Assignment Space Savings
---------------------------------------------------------
a 10 101 10×4 - 10×3 = 10 bits
9 8 000 8×4 - 8×3 = 8 bits
2 7 1111 7×4 - 7×4 = 0 bits
3 6 1101 6×4 - 6×4 = 0 bits
0 5 1100 5×4 - 5×4 = 0 bits
5 5 1001 5×4 - 5×4 = 0 bits
1 4 0010 4×4 - 4×4 = 0 bits
8 4 0111 4×4 - 4×4 = 0 bits
d 4 0101 4×4 - 4×4 = 0 bits
f 4 0110 4×4 - 4×4 = 0 bits
c 4 1000 4×4 - 4×4 = 0 bits
b 4 0011 4×4 - 4×4 = 0 bits
6 3 11100 3×4 - 3×5 = -3 bits
e 3 11101 3×4 - 3×5 = -3 bits
4 2 01000 2×4 - 2×5 = -2 bits
7 2 01001 2×4 - 2×5 = -2 bits
This yields a savings of 8 bits overall, but 8 bits isn't enough to store the huffman table. It seems because of the randomness of the data that the more bits you try to encode with huffman the less effective it works. Huffman encoding seemed to work best with 20 bits (50% reduction) but storing the table in 9 or less bits isnt possible AFAIK.
In the worst-case for a 60 bit string there are still at least 3 duplicates, the average case there are more than 3 duplicates (my assumption). As a result of at least 3 duplicates the most symbols you can have in a run of 60 bits is just 12.
Because of the duplicates plus the less than 16 symbols, I can't help but feel like there is some type of compression that can be used

If I simply count the number of 20-bit values with at least two hexadecimal digits equal, there are 524,416 of them. A smidge more than 219. So the most you could possibly save is a little less than one bit out of the 20.
Hardly seems worth it.

If I split your question in two parts:
How do I compress (perfect) random data: You can't. Every bit is some new entropy which can't be "guessed" by a compression algorithm.
How to compress "one duplicate in five characters": There are exactly 10 options where the duplicate can be (see table below). This is basically the entropy. Just store which option it is (maybe grouped for the whole line).
These are the options:
AAbcd = 1 AbAcd = 2 AbcAd = 3 AbcdA = 4 (<-- cases where first character is duplicated somewhere)
aBBcd = 5 aBcBd = 6 aBcdB = 7 (<-- cases where second character is duplicated somewhere)
abCCd = 8 abCdC = 9 (<-- cases where third character is duplicated somewhere)
abcDD = 0 (<-- cases where last characters are duplicated)
So for your first example:
01230 45647 789AA
The first one (01230) is option 4, the second 3 and the third option 0.
You can compress this by multiplying each consecutive by 10: (4*10 + 3)*10 + 0 = 430
And uncompress it by using divide and modulo: 430%10=0, (430/10)%10=3, (430/10/10)%10=4. So you could store your number like that:
1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit
The maximum number for the three options combined is 1000, so 10 bit are enough.
Compared to storing these 3 characters normally you save 2 bit. As someone else already commented - this is probably not worth it. For the whole line it's even less: 2 bit / 60 bit = 3.3% saved.

If you want to get rid of the duplicates first, do this, then look at the links at the bottom of the page. If you don't want to get rid of the duplicates, then still look at the links at the bottom of the page:
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]
console.log(uniques);
Then you would have shortened your code that you have to deal with. Then you might want to check out Smaz
Smaz is a simple compression library suitable for compressing strings.
If that doesn't work, then you could take a look at this:
http://ed-von-schleck.github.io/shoco/
Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.
Let me know if it works!

Related

Using AND bitwise operator between a number, and its negative counterpart

I stumbled upon this simple line of code, and I cannot figure out what it does. I understand what it does in separate parts, but I don't really understand it as a whole.
// We have an integer(32 bit signed) called i
// The following code snippet is inside a for loop declaration
// in place of a simple incrementor like i++
// for(;;HERE){}
i += (i&(-i))
If I understand correctly it uses the AND binary operator between i and negative i and then adds that number to i. I first thought that this would be an optimized way of calculating the absolute value of an integer, however as I come to know, c++ does not store negative integers simply by flipping a bit, but please correct me if I'm wrong.
Assuming two's complement representation, and assuming i is not INT_MIN, the expression i & -i results in the value of the lowest bit set in i.
If we look at the value of this expression for various values of i:
0 00000000: i&(-i) = 0
1 00000001: i&(-i) = 1
2 00000010: i&(-i) = 2
3 00000011: i&(-i) = 1
4 00000100: i&(-i) = 4
5 00000101: i&(-i) = 1
6 00000110: i&(-i) = 2
7 00000111: i&(-i) = 1
8 00001000: i&(-i) = 8
9 00001001: i&(-i) = 1
10 00001010: i&(-i) = 2
11 00001011: i&(-i) = 1
12 00001100: i&(-i) = 4
13 00001101: i&(-i) = 1
14 00001110: i&(-i) = 2
15 00001111: i&(-i) = 1
16 00010000: i&(-i) = 16
We can see this pattern.
Extrapolating that to i += (i&(-i)), assuming i is positive, it adds the value of the lowest set bit to i. For values that are a power of two, this just doubles the number.
For other values, it rounds the number up by the value of that lowest bit. Repeating this in a loop, you eventually end up with a power of 2. As for what such an increment could be used for, that depends on the context of where this expression was used.

Downscale array for decimal factor

Is there efficient way to downscale number of elements in array by decimal factor?
I want to downsize elements from one array by certain factor.
Example:
If I have 10 elements and need to scale down by factor 2.
1 2 3 4 5 6 7 8 9 10
scaled to
1.5 3.5 5.5 7.5 9.5
Grouping 2 by 2 and use arithmetic mean.
My problem is what if I need to downsize array with 10 elements to 6 elements? In theory I should group 1.6 elements and find their arithmetic mean, but how to do that?
Before suggesting a solution, let's define "downsize" in a more formal way. I would suggest this definition:
Downsizing starts with an array a[N] and produces an array b[M] such that the following is true:
M <= N - otherwise it would be upsizing, not downsizing
SUM(b) = (M/N) * SUM(a) - The sum is reduced proportionally to the number of elements
Elements of a participate in computation of b in the order of their occurrence in a
Let's consider your example of downsizing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 to six elements. The total for your array is 55, so the total for the new array would be (6/10)*55 = 33. We can achieve this total in two steps:
Walk the array a totaling its elements until we've reached the integer part of N/M fraction (it must be an improper fraction by rule 1 above)
Let's say that a[i] was the last element of a that we could take as a whole in the current iteration. Take the fraction of a[i+1] equal to the fractional part of N/M
Continue to the next number starting with the remaining fraction of a[i+1]
Once you are done, your array b would contain M numbers totaling to SUM(a). Walk the array once more, and scale the result by N/M.
Here is how it works with your example:
b[0] = a[0] + (2/3)*a[1] = 2.33333
b[1] = (1/3)*a[1] + a[2] + (1/3)*a[3] = 5
b[2] = (2/3)*a[3] + a[4] = 7.66666
b[3] = a[5] + (2/3)*a[6] = 10.6666
b[4] = (1/3)*a[6] + a[7] + (1/3)*a[8] = 13.3333
b[5] = (2/3)*a[8] + a[9] = 16
--------
Total = 55
Scaling down by 6/10 produces the final result:
1.4 3 4.6 6.4 8 9.6 (Total = 33)
Here is a simple implementation in C++:
double need = ((double)a.size()) / b.size();
double have = 0;
size_t pos = 0;
for (size_t i = 0 ; i != a.size() ; i++) {
if (need >= have+1) {
b[pos] += a[i];
have++;
} else {
double frac = (need-have); // frac is less than 1 because of the "if" condition
b[pos++] += frac * a[i]; // frac of a[i] goes to current element of b
have = 1 - frac;
b[pos] += have * a[i]; // (1-frac) of a[i] goes to the next position of b
}
}
for (size_t i = 0 ; i != b.size() ; i++) {
b[i] /= need;
}
Demo.
You will need to resort to some form of interpolation, as the number of elements to average isn't integer.
You can consider computing the prefix sum of the array, i.e.
0 1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 10
yields by summation
0 1 2 3 4 5 6 7 8 9
1 3 6 10 15 21 28 36 45 55
Then perform linear interpolation to get the intermediate values that you are lacking, like at 0*, 10/6, 20/6, 30/5*, 40/6, 50/6, 60/6*. (Those with an asterisk are readily available).
0 1 10/6 2 3 20/6 4 5 6 40/6 7 8 50/6 9
1 3 15/3 6 10 35/3 15 21 28 100/3 36 45 145/3 55
Now you get fractional sums by subtracting values in pairs. The first average is
(15/3-1)/(10/6) = 12/5
I can't think of anything in the C++ library that will crank out something like this, all fully cooked and ready to go.
So you'll have to, pretty much, roll up your sleeves and go to work. At this point, the question of what's the "efficient" way of doing it boils down to its very basics. Which means:
1) Calculate how big the output array should be. Based on the description of the issue, you should be able to make that calculation even before looking at the values in the input array. You know the input array's size(), you can calculate the size() of the destination array.
2) So, you resize() the destination array up front. Now, you no longer need to worry about the time wasted in growing the size of the dynamic output array, incrementally, as you go through the input array, making your calculations.
3) So what's left is the actual work: iterating over the input array, and calculating the downsized values.
auto b=input_array.begin();
auto e=input_array.end();
auto p=output_array.begin();
Don't see many other options here, besides brute force iteration and calculations. Iterate from b to e, getting your samples, calculating each downsized value, and saving the resulting value into *p++.

How does this implementation of bitset::count() work?

Here's the implementation of std::bitset::count with MSVC 2010:
size_t count() const
{ // count number of set bits
static char _Bitsperhex[] = "\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4";
size_t _Val = 0;
for (int _Wpos = _Words; 0 <= _Wpos; --_Wpos)
for (_Ty _Wordval = _Array[_Wpos]; _Wordval != 0; _Wordval >>= 4)
_Val += _Bitsperhex[_Wordval & 0xF];
return (_Val);
}
Can someone explain to me how this is working? what's the trick with _Bitsperhex?
_Bitsperhex contains the number of set bits in a hexadecimal digit, indexed by the digit.
digit: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
value: 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4
index: 0 1 2 3 4 5 6 7 8 9 A B C D E F
The function retrieves one digit at a time from the value it's working with by ANDing with 0xF (binary 1111), looks up the number of set bits in that digit, and sums them.
_Bitsperhex is a 16 element integer array that maps a number in [0..15] range to the number of 1 bits in the binary representation of that number. For example, _Bitsperhex[3] is equal to 2, which is the number of 1 bits in the binary representation of 3.
The rest is easy: each multi-bit word in internal array _Array is interpreted as a sequence of 4-bit values. Each 4-bit value is fed through the above _Bitsperhex table to count the bits.
It is a slightly different implementation of the lookup table-based method described here: http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable. At the link they use a table of 256 elements and split 32-bit words into four 8-bit values.

Find rank of a number on basis of number of 1's

Let f(k) = y where k is the y-th number in the increasing sequence of non-negative integers with
the same number of ones in its binary representation as k, e.g. f(0) = 1, f(1) = 1, f(2) = 2, f(3) = 1, f(4)
= 3, f(5) = 2, f(6) = 3 and so on. Given k >= 0, compute f(k)
many of us have seen this question
1 solution to this problem to categorise numbers on basis of number of 1's and then find the rank.i did find some patterns going by this way but it would be a lengthy process. can anyone suggest me a better solution?
This is a counting problem. I think that if you approach it with this in mind, you can do much better than literally enumerating values and checking how many bits they have.
Consider the number 17. The binary representation is 10001. The number of 1s is 2. We can get smaller numbers with two 1s by (in this case) re-distributing the 1s to any of the four low-order bits. 4 choose 2 is 6, so 17 should be the 7th number with 2 ones in the binary representation. We can check this...
0 00000 -
1 00001 -
2 00010 -
3 00011 1
4 00100 -
5 00101 2
6 00110 3
7 00111 -
8 01000 -
9 01001 4
10 01010 5
11 01011 -
12 01100 6
13 01101 -
14 01110 -
15 01111 -
16 10000 -
17 10001 7
And we were right. Generalize that idea and you should get an efficient function for which you simply compute the rank of k.
EDIT: Hint for generalization
17 is special in that if you don't consider the high-order bit, the number has rank 1; that is, f(z) = 1 where z is everything except the higher order bit. For numbers where this is not the case, how can you account for the fact that you can get smaller numbers without moving the high-order bit?
f(k) are integers less than or equal to k that have the same number of ones in their binary representation as k.
For example, k needs m bits, that is k = 2^(m-1) + a, where a < 2^(m-1). The number of integers less than 2^(m-1) that have the same number of bits as k is choose(m-1, bitcount(k)), since you can freely redistribute the ones among the m-1 least significant bits.
Integers that are greater than or equal to 2^(m-1) have the same most significant bit as k (which is 1), so there are f(k - 2^(m-1)) of them. This implies f(k) = choose(m-1, bitcount(k)) + f(k-2^(m-1)).
See "Efficiently Enumerating the Subsets of a Set". Look at Table 3, the "Bankers sequence". This is a method to generate exactly the sequence you need (if you reverse the bit order). Just run K iterations for the word with K bits. There is code to generate it included in the paper.

Bitfields in C++

I have the following code for self learning:
#include <iostream>
using namespace std;
struct bitfields{
unsigned field1: 3;
unsigned field2: 4;
unsigned int k: 4;
};
int main(){
bitfields field;
field.field1=8;
field.field2=1e7;
field.k=18;
cout<<field.k<<endl;
cout<<field.field1<<endl;
cout<<field.field2<<endl;
return 0;
}
I know that unsigned int k:4 means that k is 4 bits wide, or a maximum value of 15, and the result is the following.
2
0
1
For example, filed1 can be from 0 to 7 (included), field2 and k from 0 to 15. Why such a result? Maybe it should be all zero?
You're overflowing your fields. Let's take k as an example, it's 4 bits wide. It can hold values, as you say, from 0 to 15, in binary representation this is
0 -> 0000
1 -> 0001
2 -> 0010
3 -> 0011
...
14 -> 1110
15 -> 1111
So when you assign 18, having binary representation
18 -> 1 0010 (space added between 4th and 5th bit for clarity)
k can only hold the lower four bits, so
k = 0010 = 2.
The equivalent holds true for the rest of your fields as well.
You have these results because the assignments overflowed each bitfield.
The variable filed1 is 3 bits, but 8 takes 4 bits to present (1000). The lower three bits are all zero, so filed1 is zero.
For filed2, 17 is represented by 10001, but filed2 is only four bits. The lower four bits represent the value 1.
Finally, for k, 18 is represented by 10010, but k is only four bits. The lower four bits represent the value 2.
I hope that helps clear things up.
In C++ any unsigned type wraps around when you hit its ceiling[1]. When you define a bitfield of 4 bits, then every value you store is wrapped around too. The possible values for a bitfield of size 4 are 0-15. If you store '17', then you wrap to '1', for '18' you go one more to '2'.
Mathematically, the wrapped value is the original value modulo the number of possible values for the destination type:
For the bitfield of size 4 (2**4 possible values):
18 % 16 == 2
17 % 16 == 1
For the bitfield of size 3 (2**3 possible values):
8 % 8 == 0.
[1] This is not true for signed types, where it is undefined what happens then.