Compressing large, near-identical files

Compressing large, near-identical files - compression

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?

Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Related

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.

"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.

Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

Can compression algorithm "learn" on set of files and compress them better?

Is there compression library that support "learning" on some set of files or using some files as base for compressing other files?
This can be useful if we want to compress many similar files retaining fast access to each of them.
Something like:
# compression:
compressor.learn_on_data(standard_data);
compressor.compresss(data, data_compressed);
# decompression:
decompressor.learn_on_data(the_same_standard_data);
decompressor.decompress(data_compressed, data);
How is it called (I think that "delta compression" is a bit other thing)? Are there implementations of this in popular compression libraries? I expect it to work by, for example, pre-filling dictionaries with standard data.

Yes it works.
Although there are many techniques for this, the easiest one you'll find is called "dictionary pre-filling". In short, you are providing a file, from which the latest part is "digested" (up to the maximum window size, which can be anywhere from 4K to 64MB depending on your algorithm), and can therefore be used to better compress the next file.
In practice, this is similar to "solid mode", when within an archive all files of identical type are grouped together, so that the previous file can be used as a dictionary for the next one, which improves compression ratio.
Downside : the same dictionary must be provided for both the compressor and decompressor.

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?

I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

Multi-part gzip file random access (in Java)

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.
I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)
I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?

The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java

The design of GZIP, as you have realized, is not friendly to random access.
You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.
If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.
This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .
Do you have a GZipInputStream class? If so you are half-way there.

How to concat two or more gzip files/streams

I want to concat two or more gzip streams without recompressing them.
I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.
Several notes:
Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.
Note: it is not duplicate of this because proposed solution is not fits my needs.
Clearification edit:
I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"
If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).
Only Opera handles this well.
So I need to create a single gzip stream of several chunks and send them without re-compressing.
Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.
It is still faster 4 times then fastest gzip compression. But it is not enough.
What I need is to find the data I need to save together with gzip file in order to
not run decompression procedure, and how do I find this data during compression.

Look at the RFC1951 and RFC1952
The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.
To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.
There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.
Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.

The gzip manual says that two gzip files can be concatenated as you attempted.
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
So it appears that the other tools may be broken. As seen in this bug report.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263
Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.
As others have mentioned you may be able to perform surgery:
http://www.gzip.org/zlib/rfc-gzip.html
And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.
In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.
Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.

It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.
Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.
Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.

If taring them is not out of the question (since the linked cat solution isn't viable for you):
tar cf A_B.gz.tar A.gz B.gz
Then, to get them back:
tar xf A_B.gz.tar

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js