Why Zipalign cannot work properly with .pvr files? - compression

I am using .pvr files in my Android game. But when compressing it using Zipalign, the size of .pvr files are no change (another type of file worked well)
I tried to use the newest Zipalign tool, change flags
tools/windows/zipalign -v -f 4 C:_Working\Game.apk release_apk\Game.apk

The zipalign tool is not about compressing but about "aligning" elements in the zip file, which means moving them at a position in the zip file which is a multiple of bytes of the value you give (in this case 4 -- which means, every uncompressed element is located at an offset multiple of 4). Compression is completely orthogonal to zip-aligning.
Depending on what tool you use to build your APK, some build systems may keep some files uncompressed, so you should look at the documentation.
Another possibility is that the .pvr file is already compressed in itself so zipping it brings little gain in size.


How to serialize a diff of two folders optimally in C++

I'm trying to develop a file diff format for multiple files recursively in folders. Consider a source directory containing patched files and a destination directory containing original files. Write a size minimal diff file which expresses the difference between all files in the source and destination directory which can be applied to the original files in order to transform the original files into the patched files.
For this purpose I found the dtl library. Which algorithm or feature of the library should I use to write a file diff to the disk which I can then later read back and apply in order to patch the file? Any example code for this? I tried writing the result of the shortest edit script (SES) to the disk but I realized that I needed to specify the character and operation for every single byte. This of course makes the output file bigger than the entire comparison file, making this diff format entirely redundant since storing the entire target file instead would've saved more storage.
As another reference, this is very similar to how version control systems like git or svn operate but I don't want to use those since I'm mainly dealing with binary files and the simple requirement of creating and applying patches.
After doing some more search, I found the HDiffPatch project.
It worked fine apparently but it seems to take long on bigger folder comparisons:
diff usage: hdiffz [options] oldPath newPath outDiffFile
patch usage: hpatchz [options] oldPath diffFile outNewPath
Another good option is open-vcdiff but it only supports individual files.
use HDiffPatch: you can run hdiffz with "-s-48" for up speed;
or try "-s-32" , "-s-1k", "-s-128k" ...

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Is there a way to merge rsync and tar (compress)?

NOTE: I am using the term tar loosely here. I mean compress whether it be tar.gz, tar.bz2, zip, etc.
Is there a flag for rsync to negotiate the changed files between source/destination, tar the changed source files, send the single tar file to the destination machine and untar the changed files once arrived?
I have millions of files and remotely rsyncing across the internet to AWS seems very slow.
I know that rsync has a compression option (z), but it's my understanding that that compresses changed files on a per file basis. If there are many small files, the overhead of sending a 1KB as opposed to a 50KB file is still the bottleneck.
Also, simply tarring the whole directory is not efficient either as it will take an hour to archive
You can use the rsyncable option of gzip or pigz to compress the tar file to .gz format. (You will likely have to find a patch for gzip to add that. It's already part of pigz.)
The option partitions the resulting gzip file in a way that permits rsync to find only the modified portions for much more efficient transfers when only some of the files in the .tar.gz file have been changed.
I was looking for exact same thing as you and I landed on using borg.
tar cf - -C $DIR . | borg create $REPO::$NAME
tar will still read entire folder so you won't avoid a read penalty versus just rsyncing two dirs (since I believe rsync uses tricks to avoid reading each file for changes), but you will avoid the write penalty because borg will only write blocks it hasn't encountered before. Also borg auto compresses so no need for xz/gzip. Also, if borg is installed on both ends it won't send over superfluous data either since the two borgs can let each other know what they have versus don't.
If avoiding that read penalty is crucial for you, you could possibly use rsync to use its tricks to just tell you which files changed, create a difftar and send that to borg, but then getting borg to merge archives is whole second headache. You'd likely end up creating a filter that removes paths that were deleted from the original archive and then creating a new archive of just file additions/changes. And then you'd have to do that for each archive recursively. In the end it will create the original archive by extracting each version in sequence, but like I said a total headache.

Compressing a folder with many duplicated files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.
How can I compress the folder to a make it small enough?
I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)
Will zip\tar\cab\7z\ any other compression tool do a better job?
I don't mind letting the tool work for a few hours - but not more.
I rather not do it programmatically myself
Best options in your case is 7-zip.
Here is the options:
7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files
a - add files to archive
-r - Recurse subdirectories
-t7z - Set type of archive (7z in your case)
-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:
High compression ratio
Variable dictionary size (up to 4 GB)
Compressing speed: about 1 MB/s on 2 GHz CPU
Decompressing speed: about 10-20 MB/s on 2 GHz CPU
Small memory requirements for decompressing (depend from dictionary size)
Small code size for decompressing: about 5 KB
Supporting multi-threading and P4's hyper-threading
-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra
-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.
-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.
I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.
-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.
Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.
-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.
-mmtf=off - Set multithreading mode for filters to OFF.
-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).
-mqs=on - Sort files by type in solid archives. To store identical files together.
-bt - show execution time statistics
-bb3 - set output log level
7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.
Only if you're using command line 7-zip, see this answer.
I suggest 3 options that I've tried (in Windows):
7zip LZMA2 compression with dictionary size of 1536Mb
WinRar "solid" file
7zip WIM file
I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.
Update edit: Thanks to #Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.
New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!
Do the duplicated files have the same names? Are they usually less than 64 MB in size? Then you should sort by file name (without the path), use tar to archive all of the files in that order into a .tar file, and then use xz to compress to make a .tar.xz compressed archive. Duplicated files that are adjacent in the .tar file and are less than the window size for the xz compression level being used should compress to almost nothing. You can see the dictionary sizes, "DictSize" for the compression levels in this xz man page. They range from 256 KB to 64 MB.
WinRAR compresses by default each file separately. So there is no real gain on compressing a folder structure with many similar or even identical files by default.
But there is also the option to create a solid archive. Open help of WinRAR and open on Contents tab the item Archive types and parameters and click on Solid archives. This help page explains what a solid archive is and which advantages and disadvantages this archive file format has.
A solid archive with a larger dictionary size in combination with best compression can make an archive file with a list of similar files very small. For example I have a list of 327 binary files with file sizes from 22 KB to 453 KB which have in total 47 MB not included the cluster size of the partition. I can compress those 327 similar, but not identical files, into a RAR archive with a dictionary size of 4 MB having only 193 KB. That is of course a dramatic reduce of size.
Follow the link to help page about rarfiles.lst after reading help page about solid archive. It describes how you can control in which order the files are put into a solid archive. This file is located in program files folder of WinRAR and can be of course customized to your needs.
You have to take care also about option Files to store without compression in case of using GUI version of WinRAR. This option can be found after clicking on symbol/command Add on the tab Files. There are specified file types which are just stored in the archive without any compression like *.png, *.jpg, *.zip, *.rar, ... Those files contain usually already the data in compressed format and therefore it does not make much sense to compress them once again. But if duplicate *.jpg exist in a folder structure and a solid archive is created it makes sense to remove all file extensions from this option.
A suitable command line with using the console version Rar.exe of WinRAR and with using RAR5 archive file format would be:
"%ProgramFiles%\WinRAR\Rar.exe a -# -cfg- -ep1 -idq -m5 -ma5 -md128 -mt1 -r -s -tl -y -- "%UserProfile%\ArchiveFileName.rar" "%UserProfile%\FolderToArchive\"
The used switches in this example are explained in manual of Rar.exe which is the text file Rar.txt in program files directory of WinRAR. There can be also used WinRAR.exe with replacing the switch -idq by -ibck as explained in help of WinRAR on page Alphabetic switches list opened via last menu Help with a click on first menu item Help topics and expanding on first tab Contents the list item Command line mode and next the sublist item Switches and clicking on first item Alphabetic switches list.
By the way: There are applications like Total Commander, UltraFinder or UltraCompare and many others which support searching for duplicate files by various, user configurable criteria like finding files with same name and same size, or most secure, finding files with same size and same content, and providing functions to delete the duplicates.
Try eXdupe from www.exdupe.com, it uses deduplication and is so fast that it's practically disk I/O bound

Create .7z archive with custom header(Creating a .unity3d file)

So, I recently came across the .unity3d file for a game a used to play, and unpacked it using a tool. (http://en.unity3d.netobf.com/) Now, I've made the tweaks the the game I needed to to make it run on a local server, and have come across the issue of how to compress the files back into a .unity3d file. I've reverse engineered the tool and determined that .unity3d files are LZMA compressed( just like a .7z archive ), but the header is "UnityWeb" instead of "7z". How might I achieve this?
7z is open source. If the only difference is indeed that header, then get the sources, find where the header is, change it and compile your own compression utility. Watch out for other constants describing the headers and signatures though (e.g. length of the signature). I'd suggest starting with line 9 of the file Xz.c (defining XZ_SIG and XZ_FOOTER_SIG).