What is the best compression algorithm that allows random reads/writes in a file? [closed] - compression

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

Related

Compression Algorithms with Constant-Time Seek to Specific Byte?

I'm experimenting with building a data-structure optimized for a very specific use-case. Essentially, I am trying to build a compressed bitset of a constant size, and obviously for that use-case, two operations exist: read the value of a bit or write the value of a bit.
The best case scenario would be to be able to read a byte and write a byte in-place in constant time, but I can't imagine that it would be possible to write to an arbitrary byte without making changes to the rest of the compressed chunk of memory. However, it might be possible to read an arbitrary byte in an amount of time that tends toward O(1).
I have been reading Wikipedia articles, and I'm familiar with LZO, but is there a table somewhere which describes the various features and tradeoffs that various compression systems provide? I'd like a moderate level of compression, and I'm mainly wanting to optimize around memory holes, e.g. large gaps of bytes which are zeroes.
Assuming that you are doing many of these random accesses, you can build an index (once) to a compressed stream to get O(1). Here is an example for gzip streams.

Efficiently reading and writing mixed data types in C++-11 [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Efficient way of writing and reading mixed datatypes (viz. unsigned integer, double, uint64_t, string) on file in c++.
I need to write and read a data containing mixed data-types on disk. I used the following method to write data. However it is turning out to be very slow.
fstream myFile;
myFile.open("myFile", ios::binary, ios::out);
double x; //with appropriate initialization
myFile<<x;
int y;
myFile<<y;
uint64_t z;
myFile<<z;
string myString;
myFile<<myString;
However, this method is turning out to be very inefficient for large data of size 20 GB. Can someone please suggest as to how I can quickly read and write mixed datatypes in c++
I think the first thing you need to determine is whether or not your program actually is slow.
What do I mean by that? Of course you think it is slow, but is it slow because your particular program is inefficient, or is it slow simply because writing 20 gigabytes of data to disk is an inherently time-consuming operation to perform?
So the first thing I would do is run some benchmark tests on your hard drive to determine its raw speed (in megabytes-per-second, or whatever). There are commercial apps that do this, or you could just use a built-in utility (like dd on Unix or Mac) to give you a rough idea of how long it takes your particular hard drive to read or write 20 gigabytes of dummy data:
dd if=/dev/zero of=junk.bin bs=1024 count=20971520
dd if=junk.bin of=/dev/zero bs=1024
If dd (or whatever) is able to transfer the data significantly faster than your program can, then there is room for your program to improve. On the other hand, if dd's speed isn't much faster than your program's speed, then there's nothing you can do other than go out and buy a faster hard drive (or maybe an SSD or a RAM drive or something).
Assuming the above test does indicate that your program is less efficient than it could be, the first thing I would try is replacing your C++ iostream calls with an equivalent implementation that uses the C fopen()/fread()/fwrite()/fclose() API calls instead. Some C++ iostream implementations are known to be somewhat inefficient, but it's unlikely that the (simpler) C I/O APIs are inefficient. If nothing else, comparing the performance of the C++ and C versions would let you either confirm or deny that your C++ library's iostreams implementation is a bottleneck.
If even the C API doesn't get you the speed you need, the next thing I would look at is changing your file format to something that is easier to read or write; for example, assuming you have sufficient memory might just use mmap() to associate a large block of virtual address space with the contents of a file, and then just read/write the file-contents as if it was RAM. (That might or might not make things faster, depending on how you access the data).
If all else fails, the final thing to do is reduce the amount of data you need to read or write. Are there parts of the data that you can store separately so that you don't need to read and write them every time? Is there data there that you can store more compactly (e.g. perhaps there are commonly used strings in your data that you could store as integer-codes instead of strings)? What if you use zlib to compress the data before you write it, so that there is less data to write? The data you appear to be writing in your example looks like it would might be amenable to compression, perhaps reducing your 20GB file to a 5GB file or so. Etc.

Decompress data on the fly from offsets in the original data?

I have a block of data I want to compress, say, C structures of variable sizes. I want to compress the data, but access specific fields of structures on the fly in application code without having to decompress the entire data.
Is there an algorithm which can take the offset (for the original data), decompress and return the data?
Compression methods generally achieve compression by making use of the preceding data. At any point in the compressed data, you need to know at least some amount of the preceding uncompressed data in order to decompress what follows.
You can deliberately forget the history at select points in the compressed data in order to have random access at those points. This reduces the compression by some amount, but that can be small with sufficiently distant random access points. A simple approach would be to compress pieces with gzip and concatenate the gzip streams, keeping a record of the offsets of each stream. For less overhead, you can use Z_FULL_FLUSH in zlib to do the same thing.
Alternatively, you can save the history at each random access point in a separate file. An example of building such a random access index to a zlib or gzip stream can be found in zran.c.
You can construct compression methods that do not depend on previous history for decompression, such as simple Huffman coding. However the compression ratio will be poor compared to methods that do depend on previous history.
Example compressed file system: There we have a filesystem API which doesn't need to know about the compression that happens before it's written to disk. There are a few algorithms out there.
Check here for more details.
However, maybe there is more gain in trying to optimize the used data structures so there is no need to compress them?
For efficient access an index is needed. So between arrays and MultiMaps and Sparse Arrays there should be a way to model the data that there is no need for further compression as the data is represented efficiently.
Of course this depends largely on the use case which is quite ambiguous.
A use case where a compression layer is needed to access the data is possible to imagine, but it's likely that there are better ways to solve the problem.

What is the most efficient way to read formatted data from a large file?

Options:
1. Reading the whole file into one huge buffer and parsing it afterwards.
2. Mapping the file to virtual memory.
3. Reading the file in chunks and parsing them one by one.
The file can contain quite arbitrary data but it's mostly numbers, values, strings and so on formatted in certain ways (commas, brackets, quotations, etc).
Which option would give me greatest overall performance?
If the file is very large, then you might consider using multiple threads with option 2 or 3. Each thread can handle a single chunk of file/memory and you can overlap IO and computation (parsing) this way.
It's hard to give a general answer to your question as choosing the "right" strategy heavily depends on the organization of the data you are reading.
Especially if there's a really huge amount of data to be processed options 1. and 2. won't work anyways as the available amount of main memory poses an upper limit to any attempt like this.
Most probably the biggest gain in terms of efficiency can be accomplished by (re)structuring the data you are going to process.
Checking if there is any chance to organize the data in a way to save from needlessly processing whole chunks would be the primary spot I'd try to improve upon before addressing the problem mentioned in the question.
In terms of efficiency there's nothing but a constant to win in choosing any of the mentioned methods while on the other hand there might be much better improvement with the right organization of your data. The bigger the data the more important your decision will get.
Some facts about the data that seem interesting enough to take into consideration include:
Is there any regular pattern to the data you are going to process ?
Is the data mostly static or highly dynamic?
Does it have to be parsed sequentially or is it possible to process data in parallel?
It makes no sense to read the entire file all at once and then convert from text to binary data; it's more convenient to write, but you run out of memory faster. I would read the text in chunks and convert as you go. The converted data, in binary format instead of text, will likely take up less space than the original source text anyway.

Open-Source compression algorithm with Checkpoints [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm working in C++ with gcc 4.5.0, and msvc8/9.
I'd like to be able to compress a file (10 Gb), then open this file using my application.
However, the file contents are such that I don't necessarily need everything inside of them everytime I use them.
So, for example, one time I open one of these compressed files, and decide that I want to seek 95% of the way through the file without loading it. With compression algorithms like gzip, this is not possible: I must decompress the first 95% of the file, before I can decompress the last 5%.
So, are they any libraries similar to gzip, that are open source
and available for commercial use, that have built in check points,
to re-sync the decompression stream?
I have thought that perhaps a loseless audio codec might do the trick. I know that some of these algorithms have checkpoints so you can seek through a music file and not have to wait while the full contents of the music file are decompressed. Are there pitfalls with using an audio codec for data de/compression?
Thanks!
bzip2 is free and open-source, and has readily available library implementations. It's block based, so you can decompress only the parts you need. If you need to seek to a particular location in the decompressed file, though, you might need to build a simple index over all the bzip2 blocks, to allow you to determine which one contains the address you need, though.
gzip, although stream based, can be reset on arbitrary block boundaries. The concatenation of any number of gzip streams is itself a valid gzip stream, so you could easily operate gzip in a block-compression mode without breaking compatibility with existing decompressors, too.
I'm not sure about open-source, but there have been/are a fair number of programs that do this. If the input is static, it's pretty trivial -- pick a fixed block size, and re-start the compressor after compressing that much input data.
If the content is dynamic, things get a bit uglier, because when you change the contents of a block of input, that will typically change its size. There are at least two ways to deal with this: one is to insert a small amount of padding between blocks, so small changes can be accommodated in-place (e.g., what started as a 64K block of input gets rounded to an integral number of 512-byte compressed blocks). The second is to create an index to map from compressed blocks to de-compressed blocks. I'm pretty sure a practical solution will normally use both.
A simple approach would be to slice your uncompressed content into "blocks" and compress each one independently. They won't compress as well over all (as you won't be "sharing" between the blocks), but you can decompress blocks independently.
"Key frames" in compressed video is sort of a special-case of this general approach.
http://sourceforge.net/projects/gzx