Compressing a folder with many duplicated files [closed] - compression

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.
How can I compress the folder to a make it small enough?
I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)
Will zip\tar\cab\7z\ any other compression tool do a better job?
I don't mind letting the tool work for a few hours - but not more.
I rather not do it programmatically myself

Best options in your case is 7-zip.
Here is the options:
7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files
a - add files to archive
-r - Recurse subdirectories
-t7z - Set type of archive (7z in your case)
-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:
High compression ratio
Variable dictionary size (up to 4 GB)
Compressing speed: about 1 MB/s on 2 GHz CPU
Decompressing speed: about 10-20 MB/s on 2 GHz CPU
Small memory requirements for decompressing (depend from dictionary size)
Small code size for decompressing: about 5 KB
Supporting multi-threading and P4's hyper-threading
-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra
-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.
-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.
I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.
-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.
Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.
-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.
-mmtf=off - Set multithreading mode for filters to OFF.
-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).
-mqs=on - Sort files by type in solid archives. To store identical files together.
-bt - show execution time statistics
-bb3 - set output log level

7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.
Only if you're using command line 7-zip, see this answer.
https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files

I suggest 3 options that I've tried (in Windows):
7zip LZMA2 compression with dictionary size of 1536Mb
WinRar "solid" file
7zip WIM file
I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.
Update edit: Thanks to #Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.
New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!

Do the duplicated files have the same names? Are they usually less than 64 MB in size? Then you should sort by file name (without the path), use tar to archive all of the files in that order into a .tar file, and then use xz to compress to make a .tar.xz compressed archive. Duplicated files that are adjacent in the .tar file and are less than the window size for the xz compression level being used should compress to almost nothing. You can see the dictionary sizes, "DictSize" for the compression levels in this xz man page. They range from 256 KB to 64 MB.

WinRAR compresses by default each file separately. So there is no real gain on compressing a folder structure with many similar or even identical files by default.
But there is also the option to create a solid archive. Open help of WinRAR and open on Contents tab the item Archive types and parameters and click on Solid archives. This help page explains what a solid archive is and which advantages and disadvantages this archive file format has.
A solid archive with a larger dictionary size in combination with best compression can make an archive file with a list of similar files very small. For example I have a list of 327 binary files with file sizes from 22 KB to 453 KB which have in total 47 MB not included the cluster size of the partition. I can compress those 327 similar, but not identical files, into a RAR archive with a dictionary size of 4 MB having only 193 KB. That is of course a dramatic reduce of size.
Follow the link to help page about rarfiles.lst after reading help page about solid archive. It describes how you can control in which order the files are put into a solid archive. This file is located in program files folder of WinRAR and can be of course customized to your needs.
You have to take care also about option Files to store without compression in case of using GUI version of WinRAR. This option can be found after clicking on symbol/command Add on the tab Files. There are specified file types which are just stored in the archive without any compression like *.png, *.jpg, *.zip, *.rar, ... Those files contain usually already the data in compressed format and therefore it does not make much sense to compress them once again. But if duplicate *.jpg exist in a folder structure and a solid archive is created it makes sense to remove all file extensions from this option.
A suitable command line with using the console version Rar.exe of WinRAR and with using RAR5 archive file format would be:
"%ProgramFiles%\WinRAR\Rar.exe a -# -cfg- -ep1 -idq -m5 -ma5 -md128 -mt1 -r -s -tl -y -- "%UserProfile%\ArchiveFileName.rar" "%UserProfile%\FolderToArchive\"
The used switches in this example are explained in manual of Rar.exe which is the text file Rar.txt in program files directory of WinRAR. There can be also used WinRAR.exe with replacing the switch -idq by -ibck as explained in help of WinRAR on page Alphabetic switches list opened via last menu Help with a click on first menu item Help topics and expanding on first tab Contents the list item Command line mode and next the sublist item Switches and clicking on first item Alphabetic switches list.
By the way: There are applications like Total Commander, UltraFinder or UltraCompare and many others which support searching for duplicate files by various, user configurable criteria like finding files with same name and same size, or most secure, finding files with same size and same content, and providing functions to delete the duplicates.

Try eXdupe from www.exdupe.com, it uses deduplication and is so fast that it's practically disk I/O bound

Related

Compressing a collection of ISO files with similar content

I have a large collection of ISO files (around 1GB each) that have shared 'runs of data' between them. So, for example, one of the audio tracks may be the same (same length and content across 5 isos), but it may not necessarily have the same name or location in each.
Is there some compression technique I can apply that will detect and losslessly deduplicate this information across multiple files?
For anyone reading this, after some experimentation it turns out that by putting all the similar ISO or CHD files in a single 7zip archive (Solid archive, with maximum dictionary size of 1536MB), I was able to achieve extremely high compression via deduplication on already compressed data.
The lrzip program is designed for this kind of thing. It is available on most Linux/BSD systems package mangers, or via Cygwin for Windows.
It uses an extended version of rzip to first de-duplicate the source files, and then compresses them. Because it uses mmap() it does not have issues with the size of your RAM, like 7zip does.
In my tests lrzip was able to massively de-duplicate similar ISOs, bringing a 32GB set of OS installation discs down to around 5GB.

Why Zipalign cannot work properly with .pvr files?

I am using .pvr files in my Android game. But when compressing it using Zipalign, the size of .pvr files are no change (another type of file worked well)
I tried to use the newest Zipalign tool, change flags
tools/windows/zipalign -v -f 4 C:_Working\Game.apk release_apk\Game.apk
The zipalign tool is not about compressing but about "aligning" elements in the zip file, which means moving them at a position in the zip file which is a multiple of bytes of the value you give (in this case 4 -- which means, every uncompressed element is located at an offset multiple of 4). Compression is completely orthogonal to zip-aligning.
Depending on what tool you use to build your APK, some build systems may keep some files uncompressed, so you should look at the documentation.
Another possibility is that the .pvr file is already compressed in itself so zipping it brings little gain in size.

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?
I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

Which files does not reduce its size after compression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
So I want to know what other files will not reduce its size after compression.
Thanks
Sunil Kumar Sahoo
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
Compressed files will not reduce their size after compression.
Five years later, I have at least some real statistics to show of this.
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
PDF files are already compressed. They use the following compression algorithms:
LZW (Lempel-Ziv-Welch)
FLATE (ZIP, in PDF 1.2)
JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
There is a great link on this here
http://www.verypdf.com/pdfinfoeditor/compression.htm
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on.
You could even get files that are bigger because of the re-compressed file header.
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
You can add all Office 2007 file formats to the list (of #waqasahmed):
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
Truly random
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)