I know that for small files sometimes the compressed format can actually be larger than the initial file size.
Are the minimum file sizes known for popular compression libraries such as gzip and lz4?
I am dealing with files that are ~ 384 bytes.
This can be easily determined through experimentation. The smallest file that can be compressed by gzip is 24 bytes of zeros down to 23 bytes. The smallest lz4 can do is 31 bytes of zeros down to 30.
Related
I am studying the inner-workings of Gzip, and I understand that it uses a combination of Huffman Coding and LZ77.
I also realize that a Gzip file is divided into blocks, and each block has a dictionary built for it. Then frequent occurrences of similar data are replaced by pointers pointing at locations in the dictionary.
So the phrase "horses race other horses" would have the word horses replaced by a pointer.
However what if I have an array of 32 bit integers, but it only stores numbers up to 24 bits? For arguments sake lets say these 24 bit numbers are very random and hard to compress and hard to find repetition in.
This would make the first 8 bits of every integer an easy to compress string of 0's, but each string would need a pointer and each pointer still takes up some amount of data. Even a 1 bit pointer (which I know is smaller than what's realistically possible) would still take up 12.5% of the original space.
That would seem somewhat redundant when the array could easily be reduced to a "24 bit" array, with basic pattern recognition.
So my question is:
Does Gzip contain any mechanisms to better compress a file than dictionary pointers?
How well can Gzip compress small amounts of repetitive data, followed by small amounts of hard to compress data?
Each deflate block does not have a "dictionary built for it". What is built for each deflate block is a set of Huffman codes for the literal/length symbols and the distance symbols.
The dictionary you refer to is simply the 32K bytes of uncompressed input that immediately precede the bytes currently being compressed. That's it. Each length/distance pair can refer to a string of 3 to 258 bytes in the last 32K. That is independent of deflate blocks, and such references often go back one or more blocks.
Deflate will not do well trying to compress a sequence of three random bytes, zero byte, three random bytes, zero byte ... There will be no useful repeated strings, where deflate will only be able to Huffman code the literals, with zeros being more frequent. It would code zeros as two bits, since they occur a little more than 25% of the time, and the rest of the literals to at least 8.25 bits each. For this data that would give an average of about 6.7 bits per byte or a compressed ratio of 0.85. In fact gzip gives about 0.86 on this data.
If you want to compress that sequence, simply remove the zero bytes! Then you are done, with no further compression possible at a ratio of 0.75.
I know CRC calculation algorithm from Wikipedia. About structure of RAR file I read here. For example, there was written:
The file has the magic number of:
0x 52 61 72 21 1A 07 00
Which is a break down of the following to describe an Archive Header:
0x6152 - HEAD_CRC
0x72 - HEAD_TYPE
0x1A21 - HEAD_FLAGS
0x0007 - HEAD_SIZE
If I understand correctly, the HEAD_CRC (0x6152) is CRC value of Marker Block (MARK_HEAD). Somewhere I read, that CRC of a WinRAR file is calculated with standard polynomial 0xEDB88320, but when size of CRC is less than 4 bytes, it's necessary to use less significant bytes. In this case (of course if I undestand correctly) CRC value is 0x6152, so it has 2 bytes. Now I don't know, which bytes I have to take as less significant. From the standard polynomial (0xEDB88320)? Then 0x8320 probably are less significant bytes of this polynomial. Next, how to calculate CRC of the Marker Block (i. e. from the following bytes: 0x 52 61 72 21 1A 07 00), if we have already right polynomial?
There was likely a 16-bit check for an older format that is not derived from a 32-bit CRC. The standard 32-bit CRC, used by zip and rar, applied to the last five bytes of the header has no portion equal to the first two bytes. The Polish page appears to be incorrect in claiming that the two-byte check is the low two-bytes of a 32-bit CRC.
It does appear from the documentation that that header is constructed in a standard way as other blocks in the older format, so that the author, for fun, arranged for his format to give the check value "Ra" so that it could spell out "Rar!" followed by a text-terminating control-Z.
I found another 16-bit check in the unrar source code, but that check does not result in those values either.
Oh, and no, you can't take part of a CRC polynomial and expect that to be a good CRC polynomial for a smaller check. What the page in Polish is saying is that you would compute the full 32-bit CRC, and then take the low two bytes of the result. However that doesn't work for the magic number header.
Per WinRAR TechNote.txt file included with the install:
The marker block is actually considered as a fixed byte sequence: 0x52 0x61 0x72 0x21 0x1a 0x07 0x00
And as you already indicated, at the very end you can read:
The CRC is calculated using the standard polynomial 0xEDB88320. In case the size of the CRC is less than 4 bytes, only the low order bytes are used.
In Python, the calculation and grabbing of the 2 low order bytes goes like this:
zlib.crc32(correct_byte_range) & 0xffff
rerar has some code that does this, just like the rarfile library that it uses. ReScene .NET source code has an algorithm in C# for calculating the CRC32 hash. See also How do I calculate CRC32 mathematically?
I have a program to create a compressed file using LZW algorithm and employing hash tables. My compressed file currently contains integers corresponding to the index of hashtable.
The maximum integer in this compressed file is around 46000, which can easily be represented by 16 bits.
Now when i convert this "compressedfile.txt" to a binary file "binary.bin"(to further reduce the file size) using the following code, I get 32 bit integers in my "binary.bin" file. E.g. if there is a number 84 in my compressed file, it converts to 5400 0000 in my binary file.
std::ifstream in("compressedfile.txt");
std::ofstream out("binary.bin", ios::out | std::ios::binary);
int d;
while(in >> d)
{out.write((char*)&d, 4);}
My question is can't I discard the ending '0000' in '5400 0000' which uses up an extra 2 bytes in my file. This is the case with every integer since my max integer is 46000 which can be represented using only 2 bytes. Is there any code that can set the base of my binary file that way? I hope my question is clear.
It's writing exactly what you tell it to, 4 bytes at the address of d (an integer, 32 bit on many platforms). Use a 16 bit type and write 2 bytes instead:
uint16_t d; // unsigned to ensure it's large enough to hold your max value of 46000
while (in >> d) out.write(reinterpret_cast<char*>(&d), sizeof d);
Edit: As pointed out in the comments, for this code and the data it generates to be portable across processor architectures you should pick an endianness convention for the output. I'd suggest using htons() to convert your uint16_t to network byte order which is widely available, though not (yet) part of the C++ standard.
I have 13 numbers drawing from a set with 13 types of data, each type has 4 item so total 52 items. We can number the item as 1,2,3,4,5,6,7,8,9,10,11,12,13, so there will be 4 "1", 4"2", ... 4"13" in the set. The 13 numbers drawing from the set are random. The whole process repeated million times or even more, so I need a efficient way to store the 13 numbers. I was thinking to use some sort of coding method to compress the 13 integers into bits. For example, I count how many "1", "2" ... first, coding the count for each item with 2 bits and use 1 more bit to denote if the item was drawn or not. So for each item, we need 3 bits, total 13 items cost 39 bits. It definite need 8 bytes to do so. But it is still too much since I am talking about couple millions or billion times of calculation and each set have to be stored to the file later. So if I use 8 bytes, if will still asking about 80GB for my data. However, if I can reduce that by half, I will save 40GB. Any idea how to compress this structure more efficiently? I also think of to use 5 bytes instead but than I need to take care of the different type of number (one int + one char), is there any library in c++ can easily do the coding/compressing for me?
Thanks.
Google's Protocol Buffers can store integers with less bits, depending on its value. It might reduce your storage significantly. See http://code.google.com/p/protobuf/
The actual protocol is described here: https://developers.google.com/protocol-buffers/docs/encoding
As for compression, have you looked at how zlib handles your data?
With your scheme, every hand of 39 bits represented by 8 bytes of 64 bits will have 25 bits wasted, about 40%.
If you batch hands together, you can represent them without wasting those bits.
39 and 64 have no common factors, so the lowest common multiple is just the multiple 39 * 64 = 2496 bits, or 312 bytes. This holds 64 hands and is about 60% of the size of your current scheme.
try googling LV77 and LVZ compression
Maybe a bit more sophisticated than you're looking for, but check out HDF5.
I'm looking for a compression algorithm which works with symbols smaller than a byte. I did a quick research about compression algorithms and it's being hard to find out the size of the used symbols. Anyway, there are streams with symbols smaller than 8-bit. Is there a parameter for DEFLATE to define the size of its symbols?
plaintext symbols smaller than a byte
The original descriptions of LZ77 and LZ78 describe them in terms of a sequence of decimal digits (symbols that are approximately half the size of a byte).
If you google for "DNA compression algorithm", you can get a bunch of information on algorithms specialized for compression files that are almost entirely composed of the 4 letters A G C T, a dictionary of 4 symbols, each one about 1/4 as small as a byte.
Perhaps one of those algorithms might work for you with relatively little tweaking.
The LZ77-style compression used in LZMA may appear to use two bytes per symbol for the first few symbols that it compresses.
But after compressing a few hundred plaintext symbols
(the letters of natural-language text, or sequences of decimal digits, or sequences of the 4 letters that represent DNA bases, etc.), the two-byte compressed "chunks" that LZMA puts out often represent a dozen or more plaintext symbols.
(I suspect the same is true for all similar algorithms, such as the LZ77 algorithm used in DEFLATE).
If your files use only a restricted alphabet of much less than all 256 possible byte values,
in principle a programmer could adapt a variant of DEFLATE (or some other algorithm) that could make use of information about that alphabet to produce compressed files a few bits smaller in size than the same files compressed with standard DEFLATE.
However, many byte-oriented text compression algorithms -- LZ77, LZW, LZMA, DEFLATE, etc. build a dictionary of common long strings, and may give compression performance (with sufficiently large source file) within a few percent of that custom-adapted variant -- often the advantages of using a standard compressed file format is worth sacrificing a few percent of potential space savings.
compressed symbols smaller than a byte
Many compression algorithms, including some that give the best known compression on benchmark files, output compressed information bit-by-bit (such as most of the PAQ series of compressors, and some kinds of arithmetic coders), while others output variable-length compressed information without regard for byte boundaries (such as Huffman compression).
Some ways of describing arithmetic coding talk about pieces of information, such as individual bits or pixels, that are compressed to "less than one bit of information".
EDIT:
The "counting argument" explains why it's not possible to compress all possible bytes, much less all possible bytes and a few common sequences of bytes, into codewords that are all less than 8 bits long.
Nevertheless, several compression algorithms can and often do represent represent some bytes or (more rarely) some sequences of bytes, each with a codeword that is less than 8 bit long, by "sacrificing" or "escaping" less-common bytes that end up represented by other codewords that (including the "escape") are more than 8 bits long.
Such algorithms include:
The Pike Text compression using 4 bit coding
byte-oriented Huffman
several combination algorithms that do LZ77-like parsing of the file into "symbols", where each symbol represents a sequence of bytes, and then Huffman-compressing those symbols -- such as DEFLATE, LZX, LZH, LZHAM, etc.
The Pike algorithm uses the 4 bits "0101" to represent 'e' (or in some contexts 'E'), the 8 bits "0000 0001" to represent the word " the" (4 bytes, including the space before it) (or in some contexts " The" or " THE"), etc.
It has a small dictionary of about 200 of the most-frequent English words,
including a sub-dictionary of 16 extremely common English words.
When compressing English text with byte-oriented Huffman coding, the sequence "e " (e space) is compressed to two codewords with a total of typically 6 bits.
Alas, when Huffman coding is involved, I can't tell you the exact size of those "small" codewords, or even tell you exactly what byte or byte sequence a small codeword represents, because it is different for every file.
Often the same codeword represents a different byte (or different byte sequence) at different locations in the same file.
The decoder decides which byte or byte sequence a codeword represents based on clues left behind by the compressor in the headers, and on the data decompressed so far.
With range coding or arithmetic coding, the "codeword" may not even be an integer number of bits.
You may want to look into a Golomb-Code. A golomb code use a divide and conquer algorithm to compress the inout. It's not a dictionary compression but it's worth to mention.