How many times can a file be compressed? - compression

I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.
So my question is, how many times can I compress a file before:
It does not get any smaller?
The file becomes corrupt?
Are these two points the same or different?
Where does the point of diminishing returns appear?
How can these points be found?
I'm not talking about any specific algorithm or particular file, just in general.

For lossless compression, the only way you can know how many times you can gain by recompressing a file is by trying. It's going to depend on the compression algorithm and the file you're compressing.
Two files can never compress to the same output, so you can't go down to one byte. How could one byte represent all the files you could decompress to?
The reason that the second compression sometimes works is that a compression algorithm can't do omniscient perfect compression. There's a trade-off between the work it has to do and the time it takes to do it. Your file is being changed from all data to a combination of data about your data and the data itself.
Example
Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
Where the positive number in brackets is a repeat count and the negative number in brackets is a command to emit the next -n characters as they are found.
In this case we could try one more compression:
[3] 04 [-4] 43 fe 51 52 7 bytes (fe is your -2 seen as two's complement data)
We gained nothing, and we'll start growing on the next iteration:
[-7] 03 04 fc 43 fe 51 52 8 bytes
We'll grow by one byte per iteration for a while, but it will actually get worse. One byte can only hold negative numbers to -128. We'll start growing by two bytes when the file surpasses 128 bytes in length. The growth will get still worse as the file gets bigger.
There's a headwind blowing against the compression program--the meta data. And also, for real compressors, the header tacked on to the beginning of the file. That means that eventually the file will start growing with each additional compression.
RLE is a starting point. If you want to learn more, look at LZ77 (which looks back into the file to find patterns) and LZ78 (which builds a dictionary). Compressors like zip often try multiple algorithms and use the best one.
Here are some cases I can think of where multiple compression has worked.
I worked at an Amiga magazine that shipped with a disk. Naturally, we packed the disk to the gills. One of the tools we used let you pack an executable so that when it was run, it decompressed and ran itself. Because the decompression algorithm had to be in every executable, it had to be small and simple. We often got extra gains by compressing twice. The decompression was done in RAM. Since reading a floppy was slow, we often got a speed increase as well!
Microsoft supported RLE compression on bmp files. Also, many word processors did RLE encoding. RLE files are almost always significantly compressible by a better compressor.
A lot of the games I worked on used a small, fast LZ77 decompressor. If you compress a large rectangle of pixels (especially if it has a lot of background color, or if it's an animation), you can very often compress twice with good results. (The reason? You only have so many bits to specify the lookback distance and the length, So a single large repeated pattern is encoded in several pieces, and those pieces are highly compressible.)

Generally the limit is one compression. Some algorithms results in a higher compression ratio, and using a poor algorithm followed by a good algorithm will often result in improvements. But using the good algorithm in the first place is the proper thing to do.
There is a theoretical limit to how much a given set of data can be compressed. To learn more about this you will have to study information theory.

In general for most algorithms, compressing more than once isn't useful. There's a special case though.
If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information. Specifically, for 7 identical Excel files sized at 108kb, zipping them with 7-zip results in a 120kb archive. Zipping again results in an 18kb archive. Going past that you get diminishing returns.

Suppose we have a file N bits long, and we want to compress it losslessly, so that we can recover the original file. There are 2^N possible files N bits long, and so our compression algorithm has to change one of these files to one of 2^N possible others. However, we can't express 2^N different files in less than N bits.
Therefore, if we can take some files and compress them, we have to have some files that length under compression, to balance out the ones that shorten.
This means that a compression algorithm can only compress certain files, and it actually has to lengthen some. This means that, on the average, compressing a random file can't shorten it, but might lengthen it.
Practical compression algorithms work because we don't usually use random files. Most of the files we use have some sort of structure or other properties, whether they're text or program executables or meaningful images. By using a good compression algorithm, we can dramatically shorten files of the types we normally use.
However, the compressed file is not one of those types. If the compression algorithm is good, most of the structure and redundancy have been squeezed out, and what's left looks pretty much like randomness.
No compression algorithm, as we've seen, can effectively compress a random file, and that applies to a random-looking file also. Therefore, trying to re-compress a compressed file won't shorten it significantly, and might well lengthen it some.
So, the normal number of times a compression algorithm can be profitably run is one.
Corruption only happens when we're talking about lossy compression. For example, you can't necessarily recover an image precisely from a JPEG file. This means that a JPEG compressor can reliably shorten an image file, but only at the cost of not being able to recover it exactly. We're often willing to do this for images, but not for text, and particularly not executable files.
In this case, there is no stage at which corruption begins. It starts when you begin to compress it, and gets worse as you compress it more. That's why good image-processing programs let you specify how much compression you want when you make a JPEG: so you can balance quality of image against file size. You find the stopping point by considering the cost of file size (which is more important for net connections than storage, in general) versus the cost of reduced quality. There's no obvious right answer.

Usually compressing once is good enough if the algorithm is good.
In fact, compressing multiple times could lead to an increase in the size
Your two points are different.
Compression done repeatedly and achieving no improvement in size reduction
is an expected theoretical condition
Repeated compression causing corruption
is likely to be an error in the implementation (or maybe the algorithm itself)
Now lets look at some exceptions or variations,
Encryption may be applied repeatedly without reduction in size
(in fact at times increase in size) for the purpose of increased security
Image, video or audio files increasingly compressed
will loose data (effectively be 'corrupted' in a sense)

You can compress a file as many times as you like. But for most compression algorithms the resulting compression from the second time on will be negligible.

Compression (I'm thinking lossless) basically means expressing something more concisely. For example
111111111111111
could be more consisely expressed as
15 X '1'
This is called run-length encoding. Another method that a computer can use is to find a pattern that is regularly repeated in a file.
There is clearly a limit to how much these techniques can be used, for example run-length encoding is not going to be effect on
15 X '1'
since there are no repeating patterns. Similarly if the pattern replacement methods converts long patterns to 3 char ones, reapplying it will have little effect, because the only remaining repeating patterns will be 3-length or shorter. Generally applying compression to a already compressed file makes it slightly bigger, because of various overheads. Applying good compression to a poorly compressed file is usually less effective than applying just the good compression.

How many times can I compress a file before it does not get any smaller?
In general, not even one. Whatever compression algorithm you use, there must always exists a file that does not get compressed at all, otherwise you could always compress repeatedly until you reach 1 byte, by your same argument.
How many times can I compress a file before it becomes corrupt?
If the program you use to compress the file does its job, the file will never corrupt (of course I am thinking to lossless compression).

You can compress infinite times. However, the second and further compressions usually will only produce a file larger than the previous one. So there is no point in compressing more than once.

It is a very good question. You can view to file from different point of view. Maybe you know a priori that this file contain arithmetic series.
Lets view to it as datastream of "bytes", "symbols", or "samples".
Some answers can give to you "information theory" and "mathematical statistics"
Please check monography of that researchers for full-deep understanding:
A. Kolmogorov
S. Kullback
С. Shannon
N. Wiener
One of the main concept in information theory is entropy.
If you have a stream of "bytes"....Entropy of that bytes doesn't depend on values of your "bytes", or "samples"...
If was defined only by frequencies with which bytes retrive different values.
Maximum entropy has place to be for full random datastream.
Minimum entropy, which equal to zero, has place to be for case when your "bytes" has identical value.
It does not get any smaller?
So the entropy is minimum number of bits per your "byte", which you need to use when writing information to the disk. Of course it is so if you use god's algorithm. Real life compression lossless heuristic algorithms are not so.
The file becomes corrupt?
I dont understand sense of the question. You can write no bits to the disk and you will write a corrupted file to the disk with size equal to 0 bits. Of course it is corrupted, but his size is zero bits.

Here is the ultimate compression algorithm (in Python) which by repeated use will compress any string of digits down to size 0 (it's left as an exercise to the reader how to apply this to a string of bytes).
def compress(digitString):
if digitString=="":
raise "already as small as possible"
currentLen=len(digitString)
if digitString=="0"*currentLen:
return "9"*(currentLen-1)
n=str(long(digitString)-1); #convert to number and decrement
newLen=len(n);
return ("0"*(currentLen-newLen))+n; # add zeros to keep same length
#test it
x="12";
while not x=="":
print x;
x=compress(x)
The program outputs 12 11 10 09 08 07 06 05 04 03 02 01 00 9 8 7 6 5 4 3 2 1 0 then empty string. It doesn't compress the string at each pass but it will with enough passes compress any digit string down to a zero length string. Make sure you write down how many times you send it through the compressor otherwise you won't be able to get it back.

I would like to state that the limit of compression itself hasn't really been adapted to tis fullest limit. Since each pixel or written language is in black or write outline. One could write a program that can decompile into what it was, say a book, flawlessly, but could compress the pixel pattern and words into a better system of compression. Meaning It would probably take a lot longer to compress, but as a system file gets larget gigs or terra bytes, the repeated letters of P and R and q and the black and white deviations could be compressed expotentially into a complex automated formula. THe mhcien doesn't need the data to make sense, it just can make a game making a highly compressed pattern. This in turn then allows us the humans to create a customized compression reading engine. Meaning now we have real compression power. Design an entire engine that can restore the information on the user side. The engine has its own language that is optimal, no spaces, just fillign black and white pixel boxes of the smallest set or even writing its own patternaic language. Nad thus it can at the same time for the mostoptiaml performace, give out a unique cipher or decompression formula when its down, and thus the file is optimally compressed and has a password that is unique for the engine to decompress it later. The machine can do amost limitlesset of iterations to compress the file further. Its like having a open book and putting all the written stories of humanity currently on to one A4 sheet. I don't know but it is another theory. So what happens is split volume, because the formula to decrompress would have its own size, evne the naming of the folder and or icon information has a size so one could go further to put every form of data a a string of information. hmm..

It all depends on the algorithm. In other words the question can be how many times a file can be compressed using this algorithm first, then this one next...

Example of a more advanced compression technique using "a double table, or cross matrix"
Also elimiates extrenous unnessacry symbols in algorithm
[PREVIOUS EXAMPLE]
Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
[TURNS INTO]
04.43.51.52 VALUES
4.4.**-2 COMPRESSION
Further Compression Using Additonal Symbols as substitute values
04.A.B.C VALUES
4.4.**-2 COMPRESSION

In theory, we will never know, it is a never-ending thing:
In computer science and mathematics, the term full employment theorem
has been used to refer to a theorem showing that no algorithm can
optimally perform a particular task done by some class of
professionals. The name arises because such a theorem ensures that
there is endless scope to keep discovering new techniques to improve
the way at least some specific task is done. For example, the full
employment theorem for compiler writers states that there is no such
thing as a provably perfect size-optimizing compiler, as such a proof
for the compiler would have to detect non-terminating computations and
reduce them to a one-instruction infinite loop. Thus, the existence of
a provably perfect size-optimizing compiler would imply a solution to
the halting problem, which cannot exist, making the proof itself an
undecidable problem.
(source)

Related

Compression Algorithms with Constant-Time Seek to Specific Byte?

I'm experimenting with building a data-structure optimized for a very specific use-case. Essentially, I am trying to build a compressed bitset of a constant size, and obviously for that use-case, two operations exist: read the value of a bit or write the value of a bit.
The best case scenario would be to be able to read a byte and write a byte in-place in constant time, but I can't imagine that it would be possible to write to an arbitrary byte without making changes to the rest of the compressed chunk of memory. However, it might be possible to read an arbitrary byte in an amount of time that tends toward O(1).
I have been reading Wikipedia articles, and I'm familiar with LZO, but is there a table somewhere which describes the various features and tradeoffs that various compression systems provide? I'd like a moderate level of compression, and I'm mainly wanting to optimize around memory holes, e.g. large gaps of bytes which are zeroes.
Assuming that you are doing many of these random accesses, you can build an index (once) to a compressed stream to get O(1). Here is an example for gzip streams.

Ways to assit the compression of large custom data files

I'm seeking advice on how to better assist compression tools get better lossless compression.
I have many large files (>100meg) containing sensor readings from a variety of sensors. The samples from various sensors are of different bit sizes (16 bit, 24 bit, 32 bit) and different frequencies (70Hz to 250Hz). With the common compressors I'm aware of (zip, gzip, bzip2) I can get a compressed file about 70% of the original file size. It seems to me if I could tell the compression tool these bytes are this type of sample and those bytes are another sample type there may be compression gains to be had but I'm not aware of anything that would let me do this.
Step 0 would be to code the data in binary. (16 bits in two bytes, 24 bits in three bytes, etc.) I hope that you're already doing that.
Step 1 would be to use differences. From your description, I bet that successive values don't change much. Therefore differences will be small and have many leading zero bits. Try that, and then a general-purpose compressor.
Step 2 would be to use variable-length integer coding. The high bit of each byte determines the span of each integer. The first byte of an integer always has a high bit of zero. All subsequent bytes of the same integer have a high bit of one. Build the integer out of the low seven bits of each byte. (I take the first byte to have the least significant bits, but you could do it most-significant bit order as well.) This will code your small differences in one byte. Also this coding will handle any number of bits in the samples, which is convenient in your application. Try this, and then a general-purpose compressor.
Step 3 might be more detailed analysis of the waveforms for a better predictor. Step 1 simply uses the last value as the predictor. You could have a more complex function of the previous n values as the predictor for the next value. Whether this would help is highly dependent on your data.

Optimizing IO in C++

I'm having trouble optimizing a C++ program for the fastest runtime possible.
The requirements of the code is to output the absolute value of the difference of 2 long integers, fed through a file into the program. ie:
./myprogram < unkownfilenamefullofdata
The file name is unknown, and has 2 numbers per line, separated by a space. There is an unknown amount of test data. I created 2 files of test data. One has the extreme cases and is 5 runs long. As for the other, I used a Java program to generate 2,000,000 random numbers, and output that to a timedrun file -- 18.MB worth of tests.
The massive file runs at 3.4 seconds. I need to break that down to 1.1 seconds.
This is my code:
int main() {
long int a, b;
while (scanf("%li %li",&a,&b)>-1){
if(b>=a)
printf("%li/n",(b-a));
else
printf("%li/n",(a-b));
} //endwhile
return 0;
}//end main
I ran Valgrind on my program, and it showed that a lot of hold-up was in the read and write portion. How would I rewrite print/scan to the most raw form of C++ if I know that I'm only going to be receiving a number? Is there a way that I can scan the number in as a binary number, and manipulate the data with logical operations to compute the difference? I was also told to consider writing a buffer, but after ~6 hours of searching the web, and attempting the code, I was unsuccessful.
Any help would be greatly appreciated.
What you need to do is load the whole string into memory, and then extract the numbers from there, rather than making repeated I/O calls. However, what you may well find is that it simply takes a lot of time to load 18MB off the hard drive.
You can improve greatly on scanf because you can guarantee the format of your file. Since you know exactly what the format is, you don't need as many error checks. Also, printf does a conversion on the new line to the appropriate line break for your platform.
I have used code similar to that found in this SPOJ forum post (see nosy's post half-way down the page) to obtain quite large speed-ups in the reading integers area. You will need to modify it to deal with negative numbers. Hopefully it will give you some ideas about how to write a faster printf function as well, but I would start with replacing scanf and see how far that gets you.
As you suggest the problem is reading all these numbers in and converting from text to binary.
The best improvement would be to write the numbers out from whatever program generates them as binary. This will reduce significantly reduce the amount of data that has to be read from the disk, and slightly reduce the time needed to convert from text to binary.
You say that 2,000,000 numbers occupy 18MB = 9 bytes per number. This includes the spaces and the end of line markers, so sounds reasonable.
Storing the numbers as 4 byte integers will half the amount of data that must be read from the disk. Along with the saving on format conversion, it would be reasonable to expect a doubling of performance.
Since you need even more, something more radical is required. You should consider splitting up the data file onto separate files, each on its own disk and then processing each file in its own process. If you have 4 cores and split the processing up into 4 separate processes and can connect 4 high performace disks, then you might hope for another doubling of the performance. The bottleneck is now the OS disk management, and it is impossible to guess how well the OS will manage the four disks in parallel.
I assume that this is a grossly simplified model of the processing you need to do. If your description is all there is to it, the real solution would be to do the subtraction in the program that writes the test files!
Even better than opening the file in your program and reading it all at once, would be memory-mapping it. ~18MB is no problem for the ~2GB address space available to your program.
Then use strtod to read a number and advance the pointer.
I'd expect a 5-10x speedup compared to input redirection and scanf.

Matrix compression methods

In an application I've been working on, I have to send a 256 x 256 matrix over a socket. I'm developing a visualization client for a offshore system simulator that runs on a cluster, and this matrix is a heightmap representing the current state of the oceanic surface.
This is an realtime application, so speed is a must. And, using an 256 x 256 matrix of floats, I have to send 256 kbytes of data every second, for a bandwith requirement of 256 kbytes/second.
That's a lot, at least for my application.
So, my question is, is there some good method for compressing this matrix before sending it via socket? And, if there is such a method, how much os reduction can I expect?
As my matrix represent an continuous surface, lossy compression methods are not a problem for me. I'm mostly concerned with the compression ratio, the time that it takes for the compression to take place and, finally, if there is already an implementation of this method for C++.
If you are far enough offshore and/or in calm sea states, breaking waves are not likely to be a big problem. If this is the case, then the surface will be nicely continuous, and will likely look a lot like the superposition of multiple sine/cosine waves in X and Y.
2-D FFTs of the surface might give you some insight. You might be able to represent the surface as a bandwidth-limited 2-D FFT, and discard data for higher spatial frequencies.
First off: Numeric representation
Since I assume the physical range of the ocean high is limited (say -50.0 to 50 metres waves) if I understand your description correctly, the typical IEEE 754-2008 the 32-bit floating point (i.e. float in C/C++) uses 8-bits for it's exponent (range of -126 to 127), and 23 bits for the fraction and one bit for the sign. Note, that's base 2.
If your minimal measured (or computed) variance is 1mm, 0.001 metres, then you can reduce the floating point size need to at least 16-bits. IEEE 754 does define a 16-bit floating point value, for uses as an interchange format. Which is 5-bits for the exponent, 10-bits for the fraction, and 1-bit for the sign. I believe that would be suitable, and immediately reduce your requirements to 128KB/s (1024Kbps).
After I originally wrote this I realized that if you wanted a uniform representation, with a very small amount of error in the representation (<= 2mm), then converting to an 16-bit signed integer that a unit represents 2mm of physical height. So that you would have a uniform representation with a resolution of 2mm, from with values ranging from -32768 (== -65536 mm or approximately -65 metres, -200 ft) to 32767 (== 65534 mm or approximately 65.5 metres).
That's a very simple alternative representation based on the simples assumption that a) the valid range of values is with +/- 65.5 metres, and that 2mm resolution is acceptable for transmission.
Second: Modifying (filtering) the data
I don't know if a Discrete Cosine Transform (DCT), similar to what is used in JPEG compression might be useful as a lossy compression technique. Basically this is quantizing the data so that nearly equal neighbouring values are smoothed such that they can then be better compressed by loss-less compression methods.
Third: Traditional Lossless Compression
Otherwise reasonably fast loss-less compression techniques such as Lempel-Ziv based methods (LZ, LZH, LZW, etc.) and perhaps the fast LZO method.
Well, a matrix is just a 2D signal. So there is a lot of different compression methods.
I would first try the easy solution: go for inflate/deflate without a container (basically a Zip, without a Zip). http://en.wikipedia.org/wiki/DEFLATE
The compression level will depend on the data, so I can not say, you must try it yourself.
Otherwise, the smarter way to do it would be to send only the changes. If you have access to the server-side code, you can just send few bytes of the heightmap that is changed every second. That would be the ideal solution, and if you wish, you can even compress the changed bytes with a deflater.
First, I'd figure out whether you can change the basic encoding from 32-bit floating point to some sort of fixed point. Assuming your values all fall within a fairly specific range (which seems likely) this may well be enough to cut your bandwidth in half. Depending on the range (and precision) needed, you might well want to represent an exponential value, so you're capturing a fairly decent idea of a wide range of magnitudes, but small differences are mostly ignored.
I'd guess you don't (probably) expect huge height changes from one sample to the next, and (at a guess) you probably fairly frequently see slopes that continue across a number of samples.
If that's the case, a predictive delta compression will probably work well. The basic idea is that for every (non-edge) point, you predict the value of that point based on surrounding points, and then encode only the difference between what that predicts and the actual value for the point. Depending on how much precision you can lose, you might well be able to encode that delta into a single byte (or maybe even two per byte).
Once you've done that, you can consider using Huffman compression or even arithmetic compression, but either one will slow your compression a fair amount.
First, look at your data. How many bits of information in those floats do you really need to send? Play with chopping off the least significant bits and seeing if it's accurate enough. Next start with the basic lossless algorithms. Compress it via the LZ, lossless methods (LZ78, LZW, ...) Get a baseline lossless ratio with a fast decompress speed. Then try BZip and the likes for a possibly better compression method and a slower decompress. You now have your lossless limit. Now try some lossy algorithms. JPEG and the likes have tunable lossy ratios and still decompress really fast. Finally, add some filters. Your data would probably compress very well with a simple differential pass along the X or Y axis (or try both and save the result as 1 bit.) This should make your data even more compressable.
All told, I'd guess you could get at least x3 your current bandwidth lossless and x10 with a little loss.
If i understand you right, the surface is measured every second. So if the changes within one second are not to high, why not treat the data as video and try a video compression algorithm. Video compression also takes motion compensation into account. Motion compensation is, among the other parts of the algorithm, important for the high video compression rates.
I would try the following:
Because the height change is expected to be quite small in every second, try sending the differences in height between 2 consecutive transfers. Multiplying these numbers with 10^n so we don't have to send them as floats but integers instead. Next, use the zero-compressed encoding (plz google for it) which can reduce the number of bytes to be sent significantly. After that, use some compression algorithm to pack these bytes.
I would think it can be reduced about 50% (unless the differences are big enough to be used 3-4 bytes for each).
Well first off, how many levels of height do you need? What's the maximum difference in height from wave peak to trough? I bet you could represent it with only 256 or 65536 possible height values which immediately cuts your data to 1/2 or 1/4 without you having to modify your data structure.
You can send the min/max values as floats as well each update, so the 256 levels are always used fully to get the most accuracy possible... as the sea gets rougher you lose accuracy.
You can also save an image of 256x256 using standard image algorithms. You've not quite got a standard format bitmap cut could treat it as a grayscale - if each vertex V is scaled to a value 0-255, you can build a color (V,V,V) and for free use a JPG library that already exists. Or you can probably find a DDS file format that has a single channel of 8/16/32-bit data too.
The first part of this I did in the past, successfully. The 2nd part, I'd be keen to avoid writing your own algorithm but get your data in a form it can use existing libraries, like D3DX for example.

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .
I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"
A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.
I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.
I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.
Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.