I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?
I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.
Related
I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.
Can you keep sending the output of BZip2 (or any compression software) back through the compression process over and over again to make the output files smaller and smaller? Can you compress a file using one software (BZip2) that was already compressed using another method (Snappy)?
No and no. (For lossless compression.)
If the original file was extremely redundant, like megabytes of nothing but zeros, then the first, and maybe the second recompression will result in compression. But at some point there will be no gain from recompression, and instead a small increase in file size. For normal files, the first recompression will result in no gain.
This is true regardless of how you might mix lossless compressors.
I can't figure out what exactly is the streaming mode offered by modern compression/decompression algorithms (eg ZStandard or LZ4) and how I can exploit it.
As an example, suppose I have 4x16KB file. I can (individually) compress each file and obtain 4xDifferentCompressedLength files. However I could compress all 4 files together (sending them sequentially, right?) using streaming mode and obtain 1xCompressedLength and expect the compression ratio to be better.
Can I decompress (say) only the 3rd file without decompressing all the previous files? Do streaming mode introduce dependency between the files I appended?
Yes, streaming introduce dependency between files.
In your example, decoding file3 would require to decode first file1 then file2.
Note also that data will appear as appended, with no specific marker between files. So one would need a way to know where each file starts and ends if it's important. Sometimes it's implicit (ex : fixed 16KB size), sometimes it can be deducted from data itself (specific end-of-mark), sometimes it needs additional metadata. It all depends on the application.
You are correct that the compression ratio of C(4xFiles) is expected to be better than 4xC(File), especially if the 4 files are somewhat related (for example if they all are text files).
I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?
Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.
I want to compress small text (400 bytes) and decompress it on the other side. If I do it with standard compressor like rar or zip, it writes metadata along with the compressed file and it's bigger that the file itself..
Is there a way to compress the file without this metadata and open it on the other side with known ahead parameters?
You can do raw deflate compression with zlib. That avoids even the six-byte header and trailer of the zlib format.
However you will find that you still won't get much compression, if any at all, with just 400 bytes of input. Compression algorithms need much more history than that to get rolling, in order to build statistics and find redundancy in the data.
You should consider either a dictionary approach, where you build a dictionary of representative strings to provide the compressor something to work with, or you can consider a sequence of these 400-byte strings to be a single stream that is decompressed as a stream on the other end.
You can have a look at compression using Huffman codes. As an example look at here and here.