Decompress data on the fly from offsets in the original data? - compression

I have a block of data I want to compress, say, C structures of variable sizes. I want to compress the data, but access specific fields of structures on the fly in application code without having to decompress the entire data.
Is there an algorithm which can take the offset (for the original data), decompress and return the data?

Compression methods generally achieve compression by making use of the preceding data. At any point in the compressed data, you need to know at least some amount of the preceding uncompressed data in order to decompress what follows.
You can deliberately forget the history at select points in the compressed data in order to have random access at those points. This reduces the compression by some amount, but that can be small with sufficiently distant random access points. A simple approach would be to compress pieces with gzip and concatenate the gzip streams, keeping a record of the offsets of each stream. For less overhead, you can use Z_FULL_FLUSH in zlib to do the same thing.
Alternatively, you can save the history at each random access point in a separate file. An example of building such a random access index to a zlib or gzip stream can be found in zran.c.
You can construct compression methods that do not depend on previous history for decompression, such as simple Huffman coding. However the compression ratio will be poor compared to methods that do depend on previous history.

Example compressed file system: There we have a filesystem API which doesn't need to know about the compression that happens before it's written to disk. There are a few algorithms out there.
Check here for more details.
However, maybe there is more gain in trying to optimize the used data structures so there is no need to compress them?
For efficient access an index is needed. So between arrays and MultiMaps and Sparse Arrays there should be a way to model the data that there is no need for further compression as the data is represented efficiently.
Of course this depends largely on the use case which is quite ambiguous.
A use case where a compression layer is needed to access the data is possible to imagine, but it's likely that there are better ways to solve the problem.

Related

Compression Algorithms with Constant-Time Seek to Specific Byte?

I'm experimenting with building a data-structure optimized for a very specific use-case. Essentially, I am trying to build a compressed bitset of a constant size, and obviously for that use-case, two operations exist: read the value of a bit or write the value of a bit.
The best case scenario would be to be able to read a byte and write a byte in-place in constant time, but I can't imagine that it would be possible to write to an arbitrary byte without making changes to the rest of the compressed chunk of memory. However, it might be possible to read an arbitrary byte in an amount of time that tends toward O(1).
I have been reading Wikipedia articles, and I'm familiar with LZO, but is there a table somewhere which describes the various features and tradeoffs that various compression systems provide? I'd like a moderate level of compression, and I'm mainly wanting to optimize around memory holes, e.g. large gaps of bytes which are zeroes.
Assuming that you are doing many of these random accesses, you can build an index (once) to a compressed stream to get O(1). Here is an example for gzip streams.

Creating a gzip stream from separately compressed chunks

I like to be able to generate a gzip (.gz) file using concurrent CPU threads. I.e., I would be deflating separate chunks from the input file with separately initialized z_stream records.
The resulting file should be readable by zlib's inflate() function in a classic single threaded operation.
Is that possible? Even if it requires customized zlib code? The only requirement would be that the currently existing zlib's inflate code could handle it.
Update
The pigz source code demonstrates how it works. It uses some sophisticated optimizations to share the dictionary between chunks, keeping the compression rate optimal. It further handles bit packing if a more recent zlib version is used.
Howevever, I like to understand how to roll my own, keeping things simple, without the optimizations pigz uses.
And while many consider source code to be the ultimate documentation (Ed Post, anyone?) I rather have it explained in plain words to avoid misunderstandings. (While the docs actually describe what happens pretty well, they do not explain too well what needs to be done to roll one's own.)
From browsing the code, I figured out this much so far:
It appears that one simply creates each compressed chunk using deflate(..., Z_SYNC_FLUSH) instead of using Z_FINISH. However, deflateEnd() gives an error then, not sure if that can be ignored. And one needs to calculate the final checksum over all chunks manually, though I wonder how to add the checksum at the end. There is also a rather complex put_trailer() function for writing a gzip header - I wonder if that could also be handled by zlib's own code for simple cases?
Any clarification on this is appreciated.
Also, I realize that I should have included asking about writing a zlib stream the same way, in order to write multithreaded-compressed files to a zip archive. There, I suspect, more simplifications are possible due to the lack of the more complex gzip header.
The answer is in your question. Each thread has its own deflate instance to produce raw deflate data (see deflateInit2()), which compresses the chunk of the data fed to it, ending with Z_SYNC_FLUSH instead of Z_FINISH. Except for the last chunk of data, which you end with a Z_FINISH. Either way, this ends each resulting stream of compressed data on a byte boundary. Make sure that you get all of the generated data out of deflate(). Then you can concatenate all the compressed data streams. (In the correct order!) Precede that with a gzip header that you generate yourself. It is trivial to do that (see RFC 1952). It can just be a constant 10-byte sequence if you don't need any additional information included in the header (e.g. file name, modification date). The gzip header is not complex.
You can also compute the CRC-32 of each uncompressed chunk in the same thread or a different thread, and combine those CRC-32's using crc32_combine(). You need that for the gzip trailer.
After all of the compressed streams are written, ending with the compressed stream that was ended with a Z_FINISH, you append the gzip trailer. All that is is the four-byte CRC-32 and the low four bytes of the total uncompressed length, both in little-endian order. Eight bytes total.
In each thread you can either use deflateEnd() when done with each chunk, or if you are reusing threads for more chunks, use deflateReset(). I found in pigz that it is much more efficient to leave threads open and deflate instances open in them when processing multiple chunks. Just make sure to use deflateEnd() for the last chunk that thread processes, before closing the thread. Yes, the error from deflateEnd() can be ignored. Just make sure that you've run deflate() until avail_out is not zero to get all of the compressed data.
Doing this, each thread compresses its chunk with no reference to any other uncompressed data, where such references would normally improve the compression when doing it serially. If you want to get more advanced, you can feed each thread the chunk of uncompressed data to compress, and the last 32K of the previous chunk to provide history for the compressor. You do this with deflateSetDictionary().
Still more advanced, you can reduce the number of bytes inserted between compressed streams by sometimes using Z_PARTIAL_FLUSH's until getting to a byte boundary. See pigz for the details of that.
Even more advanced, but slower, you can append compressed streams at the bit level instead of the byte level. That would require shifting every byte of the compressed stream twice to build a new shifted stream. At least for seven out of every eight preceding compressed streams. This eliminates all of the extra bits inserted between compressed streams.
A zlib stream can be generated in exactly the same way, using adler32_combine() for the checksum.
Your question about zlib implies a confusion. The zip format does not use the zlib header and trailer. zip has its own structure, within which is imbedded raw deflate streams. You can use the above approach for those raw deflate streams as well.
Sure..
http://zlib.net/pigz/
A parallel implementation of gzip for modern
multi-processor, multi-core machines

Prediciting time or compression ratio for lossless compress of a file?

How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.
Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).
For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.
The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.
I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.
This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?
Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.
This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.
It depends on the data but with images you can take small small samples. Downsampling would change the result. Here is an example:PHP - Compress Image to Meet File Size Limit.
The compression ratio can be calculated with these formulas:
And the performance benchmarking can be done using V8 or Sunspider.
You can also use algorithms like DEFLATE or LZMA to compute the mechanism. PPM (Partial by Predicting Matching) can be used for predicting.

How to handle an array with size 1,000,000,000 in C++?

I need to handle 3D cube data. Its number of elements can be several billions. I understand I can't allocate that much memory on Windows. So I am thinking disk-based operations with in-process database. Is there any better way to do this? Maybe something in boost?
Update: I will eventually have to provide browsing functionality with plots.
Update2: The following article seemed to be a good solution using memory mapped file. I will try it and update again. http://www.codeproject.com/Articles/26275/Using-memory-mapped-files-to-conserve-physical-mem
The first and most basic step is to break the data down into chunks. The size of the chunk depends on your needs: it could be the smallest or largest chunk that can be drawn at once, or for which geometry can be built, or an optimal size for compression.
Once you're working with manageable chunks, the immediate memory problem is averted. Stream the chunks (load and unload/save) as needed.
During the load/save process, you may want to involve compression and/or a database of sorts. Even something simple like RLE and SQLite (single table with coordinates and data blob) can save a good bit of space. Better compression will allow you to work with larger chunk sizes.
Depending on usage, it may be possible to keep chunks compressed in-memory and only uncompress briefly on modification (or when they could be modified). If your data is read-only, loading them and uncompressing only when needed will be very helpful.
Splitting the data into chunks also has side-benefits, such as being an extremely simple form for octrees, allowing geometry generation (marching cubes and such) to run on isolated chunks of data (simplifies threading), and making the save/load process significantly simpler.
Can you perhaps store the data more efficiently (read "Programming Pearls" by Bentley), is it sparse data?!
If not, memory mapped files (MMF) are your friend and allow you to map chunks of MMF into memory that you can access like any other memory.
Use CreateFileMapping and MapViewOfFile to map a chunk into your process.
try VirtualAlloc from <windows.h>.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887%28v=vs.85%29.aspx
its quite useful for large arrays as long as they fit in your RAM, after that 0xC0000022L's answer is probably the better solution

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .
I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"
A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.
I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.
I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.
Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.