Multi-part gzip file random access (in Java)

Multi-part gzip file random access (in Java) - compression

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.
I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)
I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?

The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java

The design of GZIP, as you have realized, is not friendly to random access.
You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.
If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.
This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .
Do you have a GZipInputStream class? If so you are half-way there.

Related

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?

Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Could gzip compression cause data corruption?

I'm trying to come up with a solution to compress few petabytes of data I have which will be stored in AWS S3. I was thinking of using gzip compression and was wondering if compression could corrupt data. I tried searching but was not able to find any specific instances where gzip compression actually corrupted the data such that it was no longer recoverable.
I'm not sure if this is the correct forum for such question, but do I need to verify if data was correctly compressed? Also, any specific examples/data points would help.

I would not recommend using gzip directly on a large block of data in one shot.
Many times I have compressed entire drives using something similar to
dd if=/dev/sda conv=sync,noerror | gzip > /media/backup/sda.gz
and the data was unusable when I tried to restore it. I have reverted to not using compression

gzip is constantly being used all around the world and has gathered a very strong reputation for reliability. But no software is perfect. Nor is any hardware, nor is S3. Whether you need to verify the data ultimately depends on your needs, but I think a hard disk failure is more likely than a gzip corruption at this point.

GZIP compression, like just about any other commonly-used data compression algorithm, is lossless. That means when you decompress the compressed data, you get back an exact copy of the original (and not something kinda sorta maybe like it, like JPEG does for images or MP3 for audio).
As long as you use a well-known program (like, say, gzip) to do the compression, are running on reliable hardware, and don't have malware on your machine, the chances of compression introducing data corruption are basically nil.

If you care about this data, then I would recommend compressing it, and the comparing the decompression of that with the original before deleting the original. This checks for a bunch of possible problems, such as memory errors, mass storage errors, cpu errors, transmission errors, as well as the least likely of all of these, a gzip bug.
Something like gzip -dc < petabytes.gz | cmp - petabytes in Unix would be a way to do it without having to store the original data again.
Also if loss of some of the data would still leave much of the remaining data useful, I would break it up into pieces so that if one part is lost, the rest is recoverable. Any part of a gzip file requires all of what precedes it to be available and correct in order to decompress that part.

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?

Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

How to compress ascii text without overhead

I want to compress small text (400 bytes) and decompress it on the other side. If I do it with standard compressor like rar or zip, it writes metadata along with the compressed file and it's bigger that the file itself..
Is there a way to compress the file without this metadata and open it on the other side with known ahead parameters?

You can do raw deflate compression with zlib. That avoids even the six-byte header and trailer of the zlib format.
However you will find that you still won't get much compression, if any at all, with just 400 bytes of input. Compression algorithms need much more history than that to get rolling, in order to build statistics and find redundancy in the data.
You should consider either a dictionary approach, where you build a dictionary of representative strings to provide the compressor something to work with, or you can consider a sequence of these 400-byte strings to be a single stream that is decompressed as a stream on the other end.

You can have a look at compression using Huffman codes. As an example look at here and here.

How to concat two or more gzip files/streams

I want to concat two or more gzip streams without recompressing them.
I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.
Several notes:
Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.
Note: it is not duplicate of this because proposed solution is not fits my needs.
Clearification edit:
I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"
If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).
Only Opera handles this well.
So I need to create a single gzip stream of several chunks and send them without re-compressing.
Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.
It is still faster 4 times then fastest gzip compression. But it is not enough.
What I need is to find the data I need to save together with gzip file in order to
not run decompression procedure, and how do I find this data during compression.

Look at the RFC1951 and RFC1952
The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.
To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.
There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.
Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.

The gzip manual says that two gzip files can be concatenated as you attempted.
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
So it appears that the other tools may be broken. As seen in this bug report.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263
Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.
As others have mentioned you may be able to perform surgery:
http://www.gzip.org/zlib/rfc-gzip.html
And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.
In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.
Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.

It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.
Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.
Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.

If taring them is not out of the question (since the linked cat solution isn't viable for you):
tar cf A_B.gz.tar A.gz B.gz
Then, to get them back:
tar xf A_B.gz.tar

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js