Creating a gzip stream from separately compressed chunks

Creating a gzip stream from separately compressed chunks - compression

I like to be able to generate a gzip (.gz) file using concurrent CPU threads. I.e., I would be deflating separate chunks from the input file with separately initialized z_stream records.
The resulting file should be readable by zlib's inflate() function in a classic single threaded operation.
Is that possible? Even if it requires customized zlib code? The only requirement would be that the currently existing zlib's inflate code could handle it.
Update
The pigz source code demonstrates how it works. It uses some sophisticated optimizations to share the dictionary between chunks, keeping the compression rate optimal. It further handles bit packing if a more recent zlib version is used.
Howevever, I like to understand how to roll my own, keeping things simple, without the optimizations pigz uses.
And while many consider source code to be the ultimate documentation (Ed Post, anyone?) I rather have it explained in plain words to avoid misunderstandings. (While the docs actually describe what happens pretty well, they do not explain too well what needs to be done to roll one's own.)
From browsing the code, I figured out this much so far:
It appears that one simply creates each compressed chunk using deflate(..., Z_SYNC_FLUSH) instead of using Z_FINISH. However, deflateEnd() gives an error then, not sure if that can be ignored. And one needs to calculate the final checksum over all chunks manually, though I wonder how to add the checksum at the end. There is also a rather complex put_trailer() function for writing a gzip header - I wonder if that could also be handled by zlib's own code for simple cases?
Any clarification on this is appreciated.
Also, I realize that I should have included asking about writing a zlib stream the same way, in order to write multithreaded-compressed files to a zip archive. There, I suspect, more simplifications are possible due to the lack of the more complex gzip header.

The answer is in your question. Each thread has its own deflate instance to produce raw deflate data (see deflateInit2()), which compresses the chunk of the data fed to it, ending with Z_SYNC_FLUSH instead of Z_FINISH. Except for the last chunk of data, which you end with a Z_FINISH. Either way, this ends each resulting stream of compressed data on a byte boundary. Make sure that you get all of the generated data out of deflate(). Then you can concatenate all the compressed data streams. (In the correct order!) Precede that with a gzip header that you generate yourself. It is trivial to do that (see RFC 1952). It can just be a constant 10-byte sequence if you don't need any additional information included in the header (e.g. file name, modification date). The gzip header is not complex.
You can also compute the CRC-32 of each uncompressed chunk in the same thread or a different thread, and combine those CRC-32's using crc32_combine(). You need that for the gzip trailer.
After all of the compressed streams are written, ending with the compressed stream that was ended with a Z_FINISH, you append the gzip trailer. All that is is the four-byte CRC-32 and the low four bytes of the total uncompressed length, both in little-endian order. Eight bytes total.
In each thread you can either use deflateEnd() when done with each chunk, or if you are reusing threads for more chunks, use deflateReset(). I found in pigz that it is much more efficient to leave threads open and deflate instances open in them when processing multiple chunks. Just make sure to use deflateEnd() for the last chunk that thread processes, before closing the thread. Yes, the error from deflateEnd() can be ignored. Just make sure that you've run deflate() until avail_out is not zero to get all of the compressed data.
Doing this, each thread compresses its chunk with no reference to any other uncompressed data, where such references would normally improve the compression when doing it serially. If you want to get more advanced, you can feed each thread the chunk of uncompressed data to compress, and the last 32K of the previous chunk to provide history for the compressor. You do this with deflateSetDictionary().
Still more advanced, you can reduce the number of bytes inserted between compressed streams by sometimes using Z_PARTIAL_FLUSH's until getting to a byte boundary. See pigz for the details of that.
Even more advanced, but slower, you can append compressed streams at the bit level instead of the byte level. That would require shifting every byte of the compressed stream twice to build a new shifted stream. At least for seven out of every eight preceding compressed streams. This eliminates all of the extra bits inserted between compressed streams.
A zlib stream can be generated in exactly the same way, using adler32_combine() for the checksum.
Your question about zlib implies a confusion. The zip format does not use the zlib header and trailer. zip has its own structure, within which is imbedded raw deflate streams. You can use the above approach for those raw deflate streams as well.

Sure..
http://zlib.net/pigz/
A parallel implementation of gzip for modern
multi-processor, multi-core machines

Related

Collection of .gz files are corrupt by a very small number of bytes at very specific offsets

I have a large number of files, all in the same file format, which was sometimes gzipped for space conservation. I am curating the archive to eliminate duplicates.
For a significant number of the duplicate files (pairs of one gzipped, one regular), they differ by < 20 bytes, starting at one of a small number of file offsets (one offset being 313656 bytes from start of file; another far more common offset being 176287). Files are anywhere from 1MB to 200MB, uncompressed.
I believe Ubuntu Linux versions of gzip and/or 7zip command line utilities were used to compress the files. I cannot even be certain that the gzipped versions are the corrupt ones.
Does anyone know of a mechanism that would create such a specific pattern of corruption, which I can then (a) avoid in future and (b) hopefully use to choose the "correct" (most likely uncorrupted) version of the file?

When you decompress the gzip member of the pair, you are seeing a few bytes different from the already uncompressed other member of the pair? If so, then the next question is: did the gzip decompression work with no error message? If so, then the CRC-32 value at the end of the gzip file, as well as the uncompressed length, checked out as ok. In that case, the gzip file is the one you should keep.
I have no way of knowing or guessing what could have caused the corruption of the uncompressed files.

Decompress data on the fly from offsets in the original data?

I have a block of data I want to compress, say, C structures of variable sizes. I want to compress the data, but access specific fields of structures on the fly in application code without having to decompress the entire data.
Is there an algorithm which can take the offset (for the original data), decompress and return the data?

Compression methods generally achieve compression by making use of the preceding data. At any point in the compressed data, you need to know at least some amount of the preceding uncompressed data in order to decompress what follows.
You can deliberately forget the history at select points in the compressed data in order to have random access at those points. This reduces the compression by some amount, but that can be small with sufficiently distant random access points. A simple approach would be to compress pieces with gzip and concatenate the gzip streams, keeping a record of the offsets of each stream. For less overhead, you can use Z_FULL_FLUSH in zlib to do the same thing.
Alternatively, you can save the history at each random access point in a separate file. An example of building such a random access index to a zlib or gzip stream can be found in zran.c.
You can construct compression methods that do not depend on previous history for decompression, such as simple Huffman coding. However the compression ratio will be poor compared to methods that do depend on previous history.

Example compressed file system: There we have a filesystem API which doesn't need to know about the compression that happens before it's written to disk. There are a few algorithms out there.
Check here for more details.
However, maybe there is more gain in trying to optimize the used data structures so there is no need to compress them?
For efficient access an index is needed. So between arrays and MultiMaps and Sparse Arrays there should be a way to model the data that there is no need for further compression as the data is represented efficiently.
Of course this depends largely on the use case which is quite ambiguous.
A use case where a compression layer is needed to access the data is possible to imagine, but it's likely that there are better ways to solve the problem.

C++: will this disk seek take a very large performance hit?

I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
{
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
filestream.seekg(read_num);
}
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?

On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).

Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.

Read sequential file - Compressed file vs Uncompressed

I am looking for the fastest way to read a sequential file from disk.
I read in some posts that if I compressed the file using, for example, lz4, I could achieve better performance than read the flat file, because I will minimize the i/o operations.
But when I try this approach, scanning a lz4 compressed file gives me a poor performance than scanning the flat file. I didn't try the lz4demo above, but looking for it, my code is very similar.
I have found this benchmarks:
http://skipperkongen.dk/2012/02/28/uncompressed-versus-compressed-read/
http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c?r=75
Is it really possible to improve performance reading a compressed sequential file over an uncompressed one? What am I doing wrong?

Yes, it is possible to improve disk read by using compression.
This effect is most likely to happen if you use a multi-threaded reader : while one thread reads compressed data from disk, the other one decode the previous compressed block within memory.
Considering the speed of LZ4, the decoding operation is likely to finish before the other thread complete reading the next block. This way, you'll achieved a bandwidth improvement, proportional to the compression ratio of the tested file.
Obviously, there are other effects to consider when benchmarking. For example, seek times of HDD are several order of magnitude larger than SSD, and under bad circumstances, it can become the dominant part of the timing, reducing any bandwidth advantage to zero.

It depends on the speed of the disk vs. the speed and space savings of decompression. I'm sure you can put this into a formula.
Is it really possible to improve performance reading a compresses
sequential file over an uncompressed one? What am i doing wrong?
Yes, it is possible (example: a 1kb zip file could contain 1GB of data - it would most likely be faster to read and decompress the ZIP).
Benchmark different algorithms and their decompression speeds. There are compression benchmark websites for that. There are also special-purpose high-speed compression algorithms.
You could also try to change the data format itself. Maybe switch to protobuf which might be faster and smaller than CSV.

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js