Recompressing Compressed Files

Recompressing Compressed Files - compression

Can you keep sending the output of BZip2 (or any compression software) back through the compression process over and over again to make the output files smaller and smaller? Can you compress a file using one software (BZip2) that was already compressed using another method (Snappy)?

No and no. (For lossless compression.)
If the original file was extremely redundant, like megabytes of nothing but zeros, then the first, and maybe the second recompression will result in compression. But at some point there will be no gain from recompression, and instead a small increase in file size. For normal files, the first recompression will result in no gain.
This is true regardless of how you might mix lossless compressors.

Related

Is there a compression format that allows decompression at any point in the file?

I see that the issue of randomly-reading compressed data is typically resolved by block compression, allowing decompression to start at the nearest compressed block start position which, depending on the block size, should be close to where the user actually wanted to start decompression from. However, I am curious if there exists any compression algorithm that allows for decompression to truly start from any position in the compressed stream.

Certainly no standard compression format. I could imagine a simple, fixed Huffman coding of symbols for which you could enter the stream not anywhere, but at the start of any Huffman code. However without an index as big as the file itself, there would be no way to know which bit locations are the starts of codes. In any case, the compression would be unimpressive using Huffman only.

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?

Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?

I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

DICOM File compression

My line of work requires the use of DICOM files. Each DICOM file constitutes many .dcm files in a single directory. I am required to send these files over the network, a process which is somewhat so due to the massive size of the files.
I am also a programmer and I was wondering what is the ideal way to compress such files? I'm talking about a compression that will be made on the local computer and later decompressed on the destination computer (namely the compression is solely for speeding up the over-the-network transfer of the file). Is there a simple way to crop the DICOM files? (the files contain imaging of an entire head, whereas I'm only interested in a small part of the head).
Thanks!

In medical context, lossy compression is somewhere between not encouraged and forbidden. If you'd insist on cropping existing datasets the standard demands you to form at least new image & series UIDs. The standard does allow losless compression in the form of jpeg2000, but it is quite rare - if I had to bet I'd say your dataset is uncompressed altogether.
In my experience it is significantly better to compress a medical dataset as a solid archive - that is, unify all the images into a single stream. This makes a lot of sense, as there is typically a lot of similarity between nearby images and this is the way to take advantage of that similarity (a unified compression dictionary). This is available as a command line option both to rar and gzip compressors.

Solution:
gdcmconv --jpeg uncompressed.dcm compressed.dcm
or for better compression ratio:
gdcmconv --jpegls uncompressed.dcm compressed.dcm
See:
http://gdcm.sourceforge.net/html/gdcmconv.html
I would also recommend against lossy compression, you would need to be a DICOM wizard to do it properly (see derivation mechanism in the DICOM standard). I would also recommend against cropping the image (you would need to regenerate UIDs, get the Frame or Reference updated...)
HTH

You could use something simple like lzma compression on one end to pack up the files and send them over. This is the easiest solution, since you can grab something like gzip and pack/unpack the files easily programmaticly. This may help considerably, because modern computers prefer transmitting/receiving one large file over many small files (a single 1GB file will transfer much faster than 10000 100KB files).
As for actually reducing the aggregate size, each .dcm file is probably a slice (if you're looking at something like MRI or CT data), and the viewer you are using reconstructs the slices into the 3d image. Cropping them isn't impossible, but parsing the DICOM format is a bit tricky. I'm not aware of any free programs that will help you parse the DICOM files, but I haven't looked for some time.
Since DICOM is a container format, the image data you are after is usually stored in a common format (such as JPEG), so if you are able to grab the relevant part of the file to extract the image data, you can use any of the loads of image processing tools available to crop the image to whatever dimensions you choose.

We have a compression router called "DICOM Shrinkinator" that can do this as it transmits the study to PACS:
http://fluxinc.ca/medical/dicom-shrinkinator/

Which files does not reduce its size after compression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
So I want to know what other files will not reduce its size after compression.
Thanks
Sunil Kumar Sahoo

File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.

jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.

Compressed files will not reduce their size after compression.

Five years later, I have at least some real statistics to show of this.
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.

The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.

PDF files are already compressed. They use the following compression algorithms:
LZW (Lempel-Ziv-Welch)
FLATE (ZIP, in PDF 1.2)
JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
There is a great link on this here
http://www.verypdf.com/pdfinfoeditor/compression.htm

Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.

Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.

You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.

Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files

File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on.
You could even get files that are bigger because of the re-compressed file header.

Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?

Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.

You can add all Office 2007 file formats to the list (of #waqasahmed):
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.

Truly random
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))

Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js