Is there a way to merge rsync and tar (compress)? - compression

NOTE: I am using the term tar loosely here. I mean compress whether it be tar.gz, tar.bz2, zip, etc.
Is there a flag for rsync to negotiate the changed files between source/destination, tar the changed source files, send the single tar file to the destination machine and untar the changed files once arrived?
I have millions of files and remotely rsyncing across the internet to AWS seems very slow.
I know that rsync has a compression option (z), but it's my understanding that that compresses changed files on a per file basis. If there are many small files, the overhead of sending a 1KB as opposed to a 50KB file is still the bottleneck.
Also, simply tarring the whole directory is not efficient either as it will take an hour to archive

You can use the rsyncable option of gzip or pigz to compress the tar file to .gz format. (You will likely have to find a patch for gzip to add that. It's already part of pigz.)
The option partitions the resulting gzip file in a way that permits rsync to find only the modified portions for much more efficient transfers when only some of the files in the .tar.gz file have been changed.

I was looking for exact same thing as you and I landed on using borg.
tar cf - -C $DIR . | borg create $REPO::$NAME
tar will still read entire folder so you won't avoid a read penalty versus just rsyncing two dirs (since I believe rsync uses tricks to avoid reading each file for changes), but you will avoid the write penalty because borg will only write blocks it hasn't encountered before. Also borg auto compresses so no need for xz/gzip. Also, if borg is installed on both ends it won't send over superfluous data either since the two borgs can let each other know what they have versus don't.
If avoiding that read penalty is crucial for you, you could possibly use rsync to use its tricks to just tell you which files changed, create a difftar and send that to borg, but then getting borg to merge archives is whole second headache. You'd likely end up creating a filter that removes paths that were deleted from the original archive and then creating a new archive of just file additions/changes. And then you'd have to do that for each archive recursively. In the end it will create the original archive by extracting each version in sequence, but like I said a total headache.

Related

gsutil rsync between gzip/non-gzip local/cloud locations

For change detection, can gsutil's rsync use the gzip'd size for change detection?
Here's the situation:
Uploaded non-gzip'd static site content to a bucket using cp -Z so it's compressed at rest in the cloud.
Modify HTML files locally.
Need to rsync only the locally modified files.
So the upshot is that the content is compressed in the cloud and uncompressed locally. Can rsync be used to figure out what's changed?
From what I've tried, I'm thinking no because of the way rsync does it's change detection:
If -c is used, compare checksums but ONLY IF file sizes are the same.
Otherwise use times.
And it doesn't look like -J/-j impacts comparing the file size (the local uncompressed filesize is compared against the compressed cloud version which of course is FALSE) so -c won't kick in. Then, the times won't match and thus everything is uploaded again.
This seems like a fairly common use case. Is there a way of solving this?
Thank you,
Hans
To figure out how rsync identifies what has been changed while using gsutils please check Change Detection Algorithm.
I am unsure how do you want to compare between gzip non-gzip, but maybe gsutil compose could be used to make that middle step while compare between files before being compressed.
Take into account that in gsutils rsync's 4th limitation:
The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because Cloud Storage objects are immutable and no facility exists to read partial object checksums or perform partial replacements.

Caching for faster recompression of folder given edit/add/delete

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

Library for extracting zip on the fly

I have a rather large ZIP file, which gets downloaded (cannot change the file). The quest now is to unzip the file while it is downloading instead of having to wait till the central directory end is received.
Does such a library exist?
I wrote "pinch" a while back. It's in Objective-C but the method to decode files from a zip might be a way to get it in C++? Yeah, some coding will be necessary.
http://forrst.com/posts/Now_in_ObjC_Pinch_Retrieve_a_file_from_inside-I54
https://github.com/epatel/pinch-objc
I'm not sure such a library exists. Unless you are on a very fast line [or have a very slow processor], it's unlikely to save you a huge amount of time. Decompressing several gigabytes only takes a few seconds if all the data is in ram [it may then take a while to write the uncompressed data to the disk, and loading it from the disk may add to the total time].
However, assuming the sending end supports "range" downloading, you could possibly write something that downloads the directory first [by reading the fixed header first, then reading the directory and then downloading the rest of the file from start to finish]. Presumably that's how "pinch" linked in epatel's answer works.