Using Bit compression - compression

Lately I've been thinking about compression on computers, and stumbled upon the question, 'why isn't bitwise compression more common for large files?'.
I tried looking around and didn't manage to find anyone talking about the subject, at least the way I meant it, , I might not be talking about the same subject, or not using the correct name so I'll explain what I had in mind.
Lets say we have th following string "Hi I'm a string!".
Its value in binary is:
01001000011010010010000001001001001001110110110100100000011000010010000001110011011101000111001001101001011011100110011100100001
As you can see in the binary sequence there are more then several reoccurring sequences of 0's and 1's. My idea is to remove them, and include an indexing file, saying exactly where you need to add 0's or 1's and how many, for example let's break it to the first three bytes:
01001000 01101001 00100000
The indexing file will look like this:
[2,1] [5,3]
[1,1] [5, 1]
[0,1] [3, 4]
And the binary will be:
01010 010101 010
And of course since there will be filler bits until it reaches an N%8 == 0
My question is why isn't this type of compression common\existant, if it is I would love to see an example of it being used practically in the real world, if it doesn't I would love to learn why it isn't used.

This algorithm would work on certain types of data. Is is far less effective than other algorithms that are in use, though.
For example, the LZ facility of algorithms can reference data that has been seen before. It can reference strings of zeroes like your algorithm but it can also reference any other pattern. It is more general.
I don't think your algorithm would achieve compression with common English text. There are too many 1 bits and storing a bit position takes many bits.

Related

2's complement binary expected output

I'm studying for my exam, and I would like to check my answers to this question:
Suppose binary values are signed 8-bit values, representing twos-complement format with a decimal range from -128 to 127. Which of the following statements are true/false?
1) 11111111 > 0111111
I think this is false because the first digit represents the sign, so we're comparing a negative value to a positive value.
2) (11111111 + 11111111) > (00000001 - 00000010)
I'm not so sure about this one because I don't know what happens when it overflows. I think the computer just drops the last digit. So I think the left-hand side is like -128 - 128 = -256. Then the right-hand side is 1 - 2 = -1, which is represented as 1000001. This means the inequality, in decimal, becomes -256 > -1, which is false. But again, I am not so sure about this one.
3) (10000000 / 00000100) == 11100000
The first part is -0/4 and the second part is non-zero, so would it be false?
Also, there are only sample problems, and I would like to practice/explore on my own. Is there any way in which I can write a C++ program to see the expected output of questions of this form?
Thank you.
Regarding playing around with this in C++, there are a bunch of bit manipulation libraries out there (just searching for "C++ bit manipulation library" will turn up lots of results). As a few examples:
https://github.com/Chris--A/BitBool
https://www.slac.stanford.edu/comp/unix/gnu-info/libg++_23.html
You may or may not find one that has a built-in notion of two's complement. Regardless, it could be a useful exercise for yourself to implement that kind of functionality on top of what one of these libraries provides, for example by subclassing something they provide or writing your own class encapsulating an existing library implementation as a data member.

Reversible addition of 2 numbers

I have come across a problem that I would ideally like to solve where I would like to add two (problem involves 24-bit but for ease of demonstration 8-bit is fine) numbers together - simple enough, they don't even need to be strictly added, an XOR would be fine. However, I would like this to be reversible, I know you can use an XOR cipher to add a known number to the original and be able to decode it again, but I would like to do something similar but with 2 starting numbers rather than a single one.
Truncating to 8-bit numbers for ease of demonstrating:
I would like to have 10010100 added to 10110110 for instance. I would then like to be able to extract 10010100 from the resulting number to get 10110110 again.
I would add n 8-bit numbers, transmit the resultant number and then decode back to n 8-bit numbers + the original number. The transmitted number can be as large as it needs to be, although the smaller the better.
I am flexible on how to achieve this, either in hardware or in software.
Is this possible? Hopefully I have included sufficient information.
Thanks in advance

Regular expressions for bit patterns or binary data ?

A question out of curiosity, is there a way to do pattern matching on a bit level ?
Currently all the regex systems I have seen operate on a byte or character based representation, but I haven't seen, any that will let you match on a bit level.
For example, if I have a bit field like this :
011101100011100110110001
(24 bits!) can I check that bits 7,8 & 9 are the pattern 100 ?
Language agnostic answers are preferable, but as I know of nowhere that does it, I would appreciate any insight.
NOTE: I wish to do this on an arbitrary number of bits so converting to bytes (or padding to a byte size) and applying a convoluted normal regexp is NOT what I want !
Thanks,
Certainly there is no theoretical limit which would make it impossible. In fact, the associated theory can apply to any alphabet, and examples often use quite small alphabets, though not usually the one consisting of the symbols 0 and 1. You might want to read a book about computational theory.
Assuming that you're trying to check actual bits and not a string of 1s and 0s, I don't believe you can do this with regex per se, but you can apply a bit mask to check the status of certain bits. For example, to check the LMB is 1:
11000100
AND
10000000
= 10000000

Shannon's entropy formula. Help my confusion

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.
Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.
Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)
Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!
John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.