Caching for faster recompression of folder given edit/add/delete - compression

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?

Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.

Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

Related

Compression container-format with arbitrary file operations

Is there mature compression format that allows arbitrary file operations for items inside like Delete/Insert/Update but not requiring full archive recreation for this.
I'm aware of Sqlar based on Sqlite file format that naturally supports this since the mentioned operations is just deleting/inserting/updating records containing blobs. But it is more like experimental project created with other goals in mind and not widely adopted
UPDATE: to be more precise with what I have in mind, this is more like file system inside the archive when the files inserted might occupy a different "sectors" inside this container, depending on the scenario of previous delete and update operations. But the "chain" of the file is compressed while being added so occupies effectively less space than the original file.
The .zip format. You may need to copy the zip file contents to do a delete, but you don't need to recreate the archive.
Update:
The .zip format can, in principle, support the deletion and addition of entries without copying the entire zip file, as well as the re-use of the space from deleted entries. The central directory at the end can be updated and cheaply rewritten. I have heard of it being done. You would have to deal with fragmentation, as with any file system. I am not aware of an open-source library that supports using a zip file as a file system. The .zip format does not support breaking an entry into sectors that could be scattered across the zip file, as file systems do. A single entry has to be contiguous in a zip file.

How to serialize a diff of two folders optimally in C++

I'm trying to develop a file diff format for multiple files recursively in folders. Consider a source directory containing patched files and a destination directory containing original files. Write a size minimal diff file which expresses the difference between all files in the source and destination directory which can be applied to the original files in order to transform the original files into the patched files.
For this purpose I found the dtl library. Which algorithm or feature of the library should I use to write a file diff to the disk which I can then later read back and apply in order to patch the file? Any example code for this? I tried writing the result of the shortest edit script (SES) to the disk but I realized that I needed to specify the character and operation for every single byte. This of course makes the output file bigger than the entire comparison file, making this diff format entirely redundant since storing the entire target file instead would've saved more storage.
As another reference, this is very similar to how version control systems like git or svn operate but I don't want to use those since I'm mainly dealing with binary files and the simple requirement of creating and applying patches.
After doing some more search, I found the HDiffPatch project.
It worked fine apparently but it seems to take long on bigger folder comparisons:
diff usage: hdiffz [options] oldPath newPath outDiffFile
patch usage: hpatchz [options] oldPath diffFile outNewPath
EDIT:
Another good option is open-vcdiff but it only supports individual files.
use HDiffPatch: you can run hdiffz with "-s-48" for up speed;
or try "-s-32" , "-s-1k", "-s-128k" ...

Is there a way to merge rsync and tar (compress)?

NOTE: I am using the term tar loosely here. I mean compress whether it be tar.gz, tar.bz2, zip, etc.
Is there a flag for rsync to negotiate the changed files between source/destination, tar the changed source files, send the single tar file to the destination machine and untar the changed files once arrived?
I have millions of files and remotely rsyncing across the internet to AWS seems very slow.
I know that rsync has a compression option (z), but it's my understanding that that compresses changed files on a per file basis. If there are many small files, the overhead of sending a 1KB as opposed to a 50KB file is still the bottleneck.
Also, simply tarring the whole directory is not efficient either as it will take an hour to archive
You can use the rsyncable option of gzip or pigz to compress the tar file to .gz format. (You will likely have to find a patch for gzip to add that. It's already part of pigz.)
The option partitions the resulting gzip file in a way that permits rsync to find only the modified portions for much more efficient transfers when only some of the files in the .tar.gz file have been changed.
I was looking for exact same thing as you and I landed on using borg.
tar cf - -C $DIR . | borg create $REPO::$NAME
tar will still read entire folder so you won't avoid a read penalty versus just rsyncing two dirs (since I believe rsync uses tricks to avoid reading each file for changes), but you will avoid the write penalty because borg will only write blocks it hasn't encountered before. Also borg auto compresses so no need for xz/gzip. Also, if borg is installed on both ends it won't send over superfluous data either since the two borgs can let each other know what they have versus don't.
If avoiding that read penalty is crucial for you, you could possibly use rsync to use its tricks to just tell you which files changed, create a difftar and send that to borg, but then getting borg to merge archives is whole second headache. You'd likely end up creating a filter that removes paths that were deleted from the original archive and then creating a new archive of just file additions/changes. And then you'd have to do that for each archive recursively. In the end it will create the original archive by extracting each version in sequence, but like I said a total headache.

How to store a file once in a zip file instead of duplicating it in 50 folders

I have a directory structure that I need to write into a zip file that contains a single file that is duplicated in 50 sub directories. When users download the zip file, the duplicated file needs to appear in every directory. Is there a way to store the file once in a zip file, yet have it downloaded into the subdirectories when it is extracted? I cannot use shortcuts.
It would seem like Zip would be smart enough to recognize that I have 50 duplicate files and automatically store the file once... It would be silly to make this file 50 times larger than necessary!
It is possible within the ZIP specification to have multiple entries in the central directory point to the same local header offset. The ZIP application would have to precalculate the CRC of the file it was going to add and find a matching entry in the central directory of the existing ZIP file. A query for the CRC lookup against a ZIP file that contains a huge number of entries would be an expensive operation. It would also be costly to precalculate the CRC on huge files (CRC calculations are usually done during the compression routine).
I have not heard of a specific ZIP application that makes this optimization. However, it does look like StuffIt X format supports duplicate file optimization:
The StuffIt X format supports "Duplicate Detection". When adding files to an archive, StuffIt detects if there are duplicate items (even if they have the different file names), and only compresses the duplicates once, no matter how many copies there are. When expanded, StuffIt recreates all the duplicates from that one instance. Depending on the data being compressed, it can offer significant reductions in size and compression time.
I just wanted to clarify that the Suffit solution only removes duplicate files when compressing to their own proprietary format and not ZIP.

Can zip files be sparse/non-contiguous?

The zip file format ends with a central directory section that then points to the individual zip entries within the file. This appears to allow zip entries to occur anywhere within the zip file itself. Indeed, self-extracting zip files are a good example: they start with an executable and all the zip entries occur after the executable bytes.
The question is: does the zip file format really allow sparse or non-contiguous zip entries? e.g. if there are empty or otherwise unaccounted bytes between zip entries? Both the definitive PK note and wikipedia article seem to allow this. Will all/most typical zip utilities work with such sparse zip files?
The use case is this: I want to be able to delete or replace zip entries in a zip file. To do this, the typical minizip etc. libraries want you to copy out the entire zip file while not copying out the deleted or replaced zip entry, which seems wasteful and slow.
Wouldn't it be better to over-allocate, say 1.5x the storage for an entry, then when deleting or replacing an entry you could figure out where the unallocated bytes were and use those directly? Using 1.5x the storage means that if the zip entry grew linearly, the reallocations should also happen amortized linearly. It would be similar to file system block allocation though probably not as sophisticated.
This also helps with a lot of the zip-based file formats out there. Instead of having to have some temp directory somewhere (or even in memory) with the temporarily unzipped files for editing/changing and then having to rezip the lot back into the file format, this would lessen the need for rezipping and rewriting portions of the zip file.
Are there any C/C++ libraries out there that do this?
No. Reading the central directory is optional. zip decoders can, and some do, simply read the zip file sequentially from the beginning, expecting to see the local headers and entry data contiguously. They can complete the job of decoding, never having even looked at the central directory.
In order to do what you want, you would need to put in dummy zip entries between the useful entries in order to hold that space. At least if you want to be compatible with the rest of the zip world.