What compression/archive formats support inter-file compression? - compression

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.

Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.

Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.

Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.

Related

Streaming mode vs block mode

I can't figure out what exactly is the streaming mode offered by modern compression/decompression algorithms (eg ZStandard or LZ4) and how I can exploit it.
As an example, suppose I have 4x16KB file. I can (individually) compress each file and obtain 4xDifferentCompressedLength files. However I could compress all 4 files together (sending them sequentially, right?) using streaming mode and obtain 1xCompressedLength and expect the compression ratio to be better.
Can I decompress (say) only the 3rd file without decompressing all the previous files? Do streaming mode introduce dependency between the files I appended?
Yes, streaming introduce dependency between files.
In your example, decoding file3 would require to decode first file1 then file2.
Note also that data will appear as appended, with no specific marker between files. So one would need a way to know where each file starts and ends if it's important. Sometimes it's implicit (ex : fixed 16KB size), sometimes it can be deducted from data itself (specific end-of-mark), sometimes it needs additional metadata. It all depends on the application.
You are correct that the compression ratio of C(4xFiles) is expected to be better than 4xC(File), especially if the 4 files are somewhat related (for example if they all are text files).

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

Can compression algorithm "learn" on set of files and compress them better?

Is there compression library that support "learning" on some set of files or using some files as base for compressing other files?
This can be useful if we want to compress many similar files retaining fast access to each of them.
Something like:
# compression:
compressor.learn_on_data(standard_data);
compressor.compresss(data, data_compressed);
# decompression:
decompressor.learn_on_data(the_same_standard_data);
decompressor.decompress(data_compressed, data);
How is it called (I think that "delta compression" is a bit other thing)? Are there implementations of this in popular compression libraries? I expect it to work by, for example, pre-filling dictionaries with standard data.
Yes it works.
Although there are many techniques for this, the easiest one you'll find is called "dictionary pre-filling". In short, you are providing a file, from which the latest part is "digested" (up to the maximum window size, which can be anywhere from 4K to 64MB depending on your algorithm), and can therefore be used to better compress the next file.
In practice, this is similar to "solid mode", when within an archive all files of identical type are grouped together, so that the previous file can be used as a dictionary for the next one, which improves compression ratio.
Downside : the same dictionary must be provided for both the compressor and decompressor.

DICOM File compression

My line of work requires the use of DICOM files. Each DICOM file constitutes many .dcm files in a single directory. I am required to send these files over the network, a process which is somewhat so due to the massive size of the files.
I am also a programmer and I was wondering what is the ideal way to compress such files? I'm talking about a compression that will be made on the local computer and later decompressed on the destination computer (namely the compression is solely for speeding up the over-the-network transfer of the file). Is there a simple way to crop the DICOM files? (the files contain imaging of an entire head, whereas I'm only interested in a small part of the head).
Thanks!
In medical context, lossy compression is somewhere between not encouraged and forbidden. If you'd insist on cropping existing datasets the standard demands you to form at least new image & series UIDs. The standard does allow losless compression in the form of jpeg2000, but it is quite rare - if I had to bet I'd say your dataset is uncompressed altogether.
In my experience it is significantly better to compress a medical dataset as a solid archive - that is, unify all the images into a single stream. This makes a lot of sense, as there is typically a lot of similarity between nearby images and this is the way to take advantage of that similarity (a unified compression dictionary). This is available as a command line option both to rar and gzip compressors.
Solution:
gdcmconv --jpeg uncompressed.dcm compressed.dcm
or for better compression ratio:
gdcmconv --jpegls uncompressed.dcm compressed.dcm
See:
http://gdcm.sourceforge.net/html/gdcmconv.html
I would also recommend against lossy compression, you would need to be a DICOM wizard to do it properly (see derivation mechanism in the DICOM standard). I would also recommend against cropping the image (you would need to regenerate UIDs, get the Frame or Reference updated...)
HTH
You could use something simple like lzma compression on one end to pack up the files and send them over. This is the easiest solution, since you can grab something like gzip and pack/unpack the files easily programmaticly. This may help considerably, because modern computers prefer transmitting/receiving one large file over many small files (a single 1GB file will transfer much faster than 10000 100KB files).
As for actually reducing the aggregate size, each .dcm file is probably a slice (if you're looking at something like MRI or CT data), and the viewer you are using reconstructs the slices into the 3d image. Cropping them isn't impossible, but parsing the DICOM format is a bit tricky. I'm not aware of any free programs that will help you parse the DICOM files, but I haven't looked for some time.
Since DICOM is a container format, the image data you are after is usually stored in a common format (such as JPEG), so if you are able to grab the relevant part of the file to extract the image data, you can use any of the loads of image processing tools available to crop the image to whatever dimensions you choose.
We have a compression router called "DICOM Shrinkinator" that can do this as it transmits the study to PACS:
http://fluxinc.ca/medical/dicom-shrinkinator/

Which files does not reduce its size after compression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
So I want to know what other files will not reduce its size after compression.
Thanks
Sunil Kumar Sahoo
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
Compressed files will not reduce their size after compression.
Five years later, I have at least some real statistics to show of this.
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
PDF files are already compressed. They use the following compression algorithms:
LZW (Lempel-Ziv-Welch)
FLATE (ZIP, in PDF 1.2)
JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
There is a great link on this here
http://www.verypdf.com/pdfinfoeditor/compression.htm
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on.
You could even get files that are bigger because of the re-compressed file header.
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
You can add all Office 2007 file formats to the list (of #waqasahmed):
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
Truly random
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)