Compression of sorted data with small difference - compression

I have sorted data sequence of integers. Maximal difference between 2 numbers is 3. So data looks for example like this:
Data: 1 2 3 5 7 8 9 10 13 14
Differences: (start 1) 1 1 2 2 1 1 1 3 1
Is there a better way to store (compress) this type of sequences, than save difference values? Because if I use dictionary based methods, It failed to compress, because of randomness of numbers 1,2 and 3. If I use "PAQ" style compression, result are better, but still not quite satisfying. Huffman and Arithmetic coder is worse than dictionary based methods.
Is there some way with prediction?
For example to use regression for original data and than store differences (which could be smaller or more consistent)
Or use some kind of prediction based on histogram of differences?
Or something totally different.... or its not possible at all (which is, in my oppinion, the real answer :))

Since you say in the comments that you're already storing four differences per byte, you're likely to not do much better. If the differences 0, 1, 2, and 3 were random and evenly distributed, then there would be no way to do better.
If they are not evenly distributed, then you might be able to do better with a Huffman or arithmetic code. E.g. if 1 is more common than 0, which is more common than 2 and 3, then you could store 1 as 0, 0 as 10, 2 as 110, and 3 as 111. Or if 0 never happens, 1 as 0, 2 and 3 as 10 and 11. You could do better with an arithmetic code for the case you quote where 1 occurs 80% of the time. Or a poor man's arithmetic code by coding pairs of symbols. E.g.:
11 0
13 100
21 101
12 110
31 1110
22 111100
23 111101
32 111110
33 111111
would be a good code for 1 80%, 2 10%, 3 10%. (That doesn't quite handle the case of an odd number of differences, but you could deal with that with just a bit at the start indicating an even or odd number, and a few more bits at the end if odd.)
There might be a better predictor than the previous value. This would be a function of n previous values instead of just one previous value. However this would be highly data dependent. For example you could assume that the current value is likely to fall on the line made by the previous two values. Or that it falls on the parabola made by the previous three values. Or some other function, e.g. a sinusoid with some frequency, if the data is so biased.

Related

How to compress a time series where the only values are 1, 0, and -1

I am trying to efficiently store a huge number ( > 1 billion) time series. Each value can only be 1, 0 or -1 and the value is recorded once a minute for 40,000 minutes.
I realize that each minute the value can be stored in 2 bits, but I think there is an easier way: there are a limited number of permutations for any time period, so I could just assign a number to each permutation instead of recording all the bits.
For example, if I were to take a 16 minute period: to record those values would require (16 x 2 bits) = 32 bits = 4 bytes. But presumably, I can cut that number in half (or more) if I simply assign a number to each of the 16 possible permutations.
My question: what is the formula for determining the number of permutations for 16 values? I know how to calculate it if the values can be any number, but am stumped as to how to do it when there are just 3 values.
For instance you can zip the file and you will get a great compression level with only 3 symbols.
If you want to do hard work, you can do what basic zip algorithms do:
You have 3 values -1, 0, and 1.
Then you can define a transaltion tree like:
bit sequence - symbol
0 - 0
10 - 1
110 - -1
1110 - End of data
So if you read a zero you know it is a 0 symbol, and if you read a 1 you have to read the next bit to know if it is a 1 or if you have to read one more to know if it is a -1.
So if you have a series 1,1,0,-1,0 it would translate as:
101001100
If this is all the data you see you have 9 bits, so you would need to complete with something to get to 16.
Then just put an end of data marker and after that anytihg.
10100110 01110000
To do this you need to work with bit operators.
If you know that any of these symbols has a rate of occurance greater that the rest, use that symbol with less amount of bits (for example the 0 should represent the most used symbol).
If -1, 0, and 1 are all equally likely, then the formula for the number of bits required for n samples is ceiling(n log23). For one sample, you get two bits as you have noted, effectively wasting one of the states, a little more than 0.4 bits per sample wasted.
As it turns out, five samples fit really nicely into eight bits, where 35 = 243, with only about 0.015 bits per symbol wasted.
You can use the extra states as end-of-stream symbols. For example, you could use five of the remaining 13 states to signal end-of-stream, indicating that there are 0, 1, 2, 3, or 4 samples remaining. Then if it's 1, 2, 3, or 4, there is one more byte with those samples. A little better would be to use three states for the 1 case, providing the sample in that byte. Then seven of the 13 states are used, requiring one byte to end the stream for the 0 and 1 cases, and two bytes to end the stream for the cases of 2, 3, or 4 remaining.
If -1, 0, and 1 have noticeably different probabilities, then you can use Huffman coding on the samples to represent the result in fewer bits than the "flat" case above. However there is only one Huffman code for one sample of three symbols, which would not give good performance in general. So you would again want to combine samples for better Huffman coding performance. (Or use arithmetic coding, but that is more involved than perhaps necessary in this case.) So you could again group five samples into one integer in the range 0..242, and Huffman code those, along with an end-of-stream symbol (call it 243) that occurs only once.

Advice needed for an API for reading bits

I found a wonderful project called python-bitstring, and I believe a C++ port could be very helpful in quite some situations (for sure in some projects of mine).
While porting the read/write/patch bytes methods, I didn't get any problems at all; it was as easy as translating Python to C++.
Anyway, now I'm getting to the bits methods and I'm not really sure how to express that functionality.
For example, let's say I want to create a method like:
readBits(uint64_t n_bits_to_read, uint64_t n_bits_to_skip = 0) {...}
Let's suppose, for the sake of this example, that this->data is a chunk of memory (void *) holding the entire data from which I'm reading.
So, the method will receive a number of bits to read and an optional number of bits to skip.
this->readBits(5, 2);
That way I'll be reading bits from position 2 to position 6 inclusive (forget little/big endian for the sake of this example).
0 1 1 0 1 0 1 1
‾ ‾ ‾ ‾ ‾
I can't return anything smaller than a byte (or can I?), so even if I actually read 5 bits, I'll still be returning 8. But what if I read 14 bits and skip 1? Is there any other way I could return only those bits in some more useful way?
I'm thinking about a few common situations, for example:
Do the first 14 bits match "010101....."
Do the next 13 bits after skipping 2 match "00011010....."
Read the first 5 bits and convert them to an int/float
Read 7 bits after skipping 5 and convert them to an int/float
My question is: what type of data/structure/methods should I return/expose in order to make working with bits easier (or at least easier for the previously described situations).

Space-efficient way to store a multiset/unordered list

I need to store a large number of integers in a file. The order of the integers does not matter, so the total information content should be lower than that of an ordered list. Is there a more space-efficient way to store the numbers than as an arbitrarily ordered array?
Edit: I assume the integers to be completely random. I am really looking for a universal way to squeeze out the redundant information which is introduced by fixing a permutation.
To compress, you actually need higher information content. You cannot in general compress randomness. Lucky for you, your problem specification allows you to reorder the data. Therefore, you may sort the data, thereby increasing its information content. Then, instead of storing the list of integers, you need only store the smallest and the sequence of first differences. The first differences will be smaller than the numbers themselves, so should fit into fewer bits.
The sorted randomly generated sequence
sorted seq (173 218 257 490 618 638 715 815 856 929 932 996)
number of bits ( 6 6 6 7 7 7 7 7 7 7 7 7)
can be stored as
first diff (173 45 39 233 128 20 77 100 41 73 3 64)
number of bits ( 6 4 4 6 5 3 5 5 4 5 2 5)
Where e.g. 45 is the difference between 173 and 218, the first and second elements. These numbers require 54 bits versus 81 above. If the numbers are fairly dense in the range from which they were drawn, you may see the maximum bits for the first difference be lower than the data enabling you to use a smaller fixed bit-length. If you do not use fixed size, you must also store delimiters or use some other adaptive scheme so you can determine where one number leaves off and the next begins. If your data has a large number of duplicates, as would occur if your numbers are drawn randomly from a relatively small range, you might also look into run length encoding the zeros in the first differences.
In general I would say no. If your numbers have some pattern or are distributed in some singular way then you should mention it.
This paper does exactly this.
Compressing Multisets with Large Alphabets
Paper: https://arxiv.org/abs/2107.09202
Code: https://github.com/facebookresearch/multiset-compression
Summary: https://twitter.com/_dsevero/status/1419661190750425102

seeking a better way to code and compress numbers

I have 13 numbers drawing from a set with 13 types of data, each type has 4 item so total 52 items. We can number the item as 1,2,3,4,5,6,7,8,9,10,11,12,13, so there will be 4 "1", 4"2", ... 4"13" in the set. The 13 numbers drawing from the set are random. The whole process repeated million times or even more, so I need a efficient way to store the 13 numbers. I was thinking to use some sort of coding method to compress the 13 integers into bits. For example, I count how many "1", "2" ... first, coding the count for each item with 2 bits and use 1 more bit to denote if the item was drawn or not. So for each item, we need 3 bits, total 13 items cost 39 bits. It definite need 8 bytes to do so. But it is still too much since I am talking about couple millions or billion times of calculation and each set have to be stored to the file later. So if I use 8 bytes, if will still asking about 80GB for my data. However, if I can reduce that by half, I will save 40GB. Any idea how to compress this structure more efficiently? I also think of to use 5 bytes instead but than I need to take care of the different type of number (one int + one char), is there any library in c++ can easily do the coding/compressing for me?
Thanks.
Google's Protocol Buffers can store integers with less bits, depending on its value. It might reduce your storage significantly. See http://code.google.com/p/protobuf/
The actual protocol is described here: https://developers.google.com/protocol-buffers/docs/encoding
As for compression, have you looked at how zlib handles your data?
With your scheme, every hand of 39 bits represented by 8 bytes of 64 bits will have 25 bits wasted, about 40%.
If you batch hands together, you can represent them without wasting those bits.
39 and 64 have no common factors, so the lowest common multiple is just the multiple 39 * 64 = 2496 bits, or 312 bytes. This holds 64 hands and is about 60% of the size of your current scheme.
try googling LV77 and LVZ compression
Maybe a bit more sophisticated than you're looking for, but check out HDF5.

How can I do arithmetic on numbers with a non-standard binary representation?

With unsigned char you can store a number from 0 to 255
255(b10) = 11111111(b2) <= that's 1 byte
This will make it easy to preform operations like +,-,*...
Now how about:
255(b10) = 10101101(b2)
Following this method will make it possible to represent up to 399 using unsigned char?
399(b10) = 11111111(b2)
Can someone propose an algorithm to preform addition using the last method?
With eight bits there are only 256 possible value (28), no matter how you slice and dice it.
Your scheme to encode digits in a 2-3-3 form like:
255 = 10 101 101
399 = 11 111 111
ignores the fact that those three-bit sequences in there can only represent eight values (0-7), not ten (ie, that second one would be 377, not 399).
The trade-off is that this means you gain the numbers '25[6-7]' (2 values) '2[6-7][0-7]' (16 values) and '3[0-7][0-7]' (64 values) for a total of 82 values.
Your sacrifice for that gain is that you can no longer represent any numbers containing 8 or 9: '[8-9]' (2 values), '[1-7][8-9]' (14 values), '[8-9][0-9]' (20 values), '1[0-7][8-9]' (16 values), '1[8-9][0-9]' (20 values) or '2[0-4][8-9]' (10 values), for a total of 82 values.
The balance there (82 vs. 82) shows that there are still only 256 possible values for an eight-bit data type.
So your encoding scheme is based on a flawed premise, which makes the second part of your question (how to add them) irrelevant, I'm afraid.
A unsigned char type can only mathematically hold values between 0 and 255 as determined by the rule 2^n - 1 for the maximum unsigned value that the amount of bits n can represent. There is no way to "improve" a char range, you probably want to use an unsigned short which holds two bytes instead.
You're mistaken.
In your scheme, 255 would be 010101101, which is 9 bits. The leading zero is important. I'm assuming here you're using something that looks like the octal representation. 3 bits/digit. Any other alternative means you cannot represent all the other digits.
|0|000|
|1|001|
|2|010|
|3|011|
|4|100|
|5|101|
|6|110|
|7|111|
|8|???|
|9|???|
9 in binary is 1001.
So you can't use 3 bits per digit. You need to use 4 bits if you want to represent 8 and 9. Again, I'm trying to assume here that you're encoding each digit separately.
So, 399 according to you would be: 001110011001 - 12 bits.
By comparison, binary does 399 in 110001111 - 9 bits.
So binary is the most efficient, because encoding digits from 0 to 9 in your system means that the maximum number you can store without any information loss in 8 bits is 99 - 10011001 :)
One way to think of binary, is a path that is the result of a log search to find the number.
If you really want to condense the number of bits needed to represent a number, what you're really after is some sort of compression and not the way binary is done.
What you want to do is mathematically impossible. You can only represent 256 discrete values with 8 boolean values.
To test this, make a chart of all possible values, in decimal and binary. I.e.
000 = 00000000
001 = 00000001
002 = 00000010
003 = 00000011
004 = 00000100
...
254 = 11111110
255 = 11111111
You will see that after 255, you need a ninth bit.
You can let 255 = 10101101, but if you work backwards from that, you will run out before you reach 0.
You seem to hope you can somehow use a different counting mechanism to store more values. This is not mathematically possible. See the Pidgeonhole Principle.