gsutil rsync between gzip/non-gzip local/cloud locations - google-cloud-platform

For change detection, can gsutil's rsync use the gzip'd size for change detection?
Here's the situation:
Uploaded non-gzip'd static site content to a bucket using cp -Z so it's compressed at rest in the cloud.
Modify HTML files locally.
Need to rsync only the locally modified files.
So the upshot is that the content is compressed in the cloud and uncompressed locally. Can rsync be used to figure out what's changed?
From what I've tried, I'm thinking no because of the way rsync does it's change detection:
If -c is used, compare checksums but ONLY IF file sizes are the same.
Otherwise use times.
And it doesn't look like -J/-j impacts comparing the file size (the local uncompressed filesize is compared against the compressed cloud version which of course is FALSE) so -c won't kick in. Then, the times won't match and thus everything is uploaded again.
This seems like a fairly common use case. Is there a way of solving this?
Thank you,
Hans

To figure out how rsync identifies what has been changed while using gsutils please check Change Detection Algorithm.
I am unsure how do you want to compare between gzip non-gzip, but maybe gsutil compose could be used to make that middle step while compare between files before being compressed.
Take into account that in gsutils rsync's 4th limitation:
The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because Cloud Storage objects are immutable and no facility exists to read partial object checksums or perform partial replacements.

Related

Caching for faster recompression of folder given edit/add/delete

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

one line edits to files in AWS S3

I have many very large files (> 6 GB) stored in an AWS S3 bucket that need very minor edits done to them.
I can edit these files by pulling them to a server, using sed or perl to edit the key word, and then pushing them back, but this is very time-consuming, especially for a one-word edit to a 6 or 7 GB text file.
I use a program that makes the AWS S3 like a random-access file system, https://github.com/s3fs-fuse/s3fs-fuse, but this is unusuably slow, so it isn't an option.
How can I edit these files, or use sed, via a script without the expensive and slow step of pulling from and pushing back to S3?
You can't.
The library you use certainly does it right: download the existing file, do the edit locally, then push back the results. It's always going to be slow.
With sed, it may be possible to make it faster, assuming your existing library does it in three separate steps. But you can't send the result right back and overwrite the file before you're done reading it (at least I would suggest not doing so.)
If this is a one time process, then the slowness should not be an issue. If that's something you are likely to perform all the time, then I'd suggest you use a different type of storage. This one may not be appropriate for your app.

Is there a way to merge rsync and tar (compress)?

NOTE: I am using the term tar loosely here. I mean compress whether it be tar.gz, tar.bz2, zip, etc.
Is there a flag for rsync to negotiate the changed files between source/destination, tar the changed source files, send the single tar file to the destination machine and untar the changed files once arrived?
I have millions of files and remotely rsyncing across the internet to AWS seems very slow.
I know that rsync has a compression option (z), but it's my understanding that that compresses changed files on a per file basis. If there are many small files, the overhead of sending a 1KB as opposed to a 50KB file is still the bottleneck.
Also, simply tarring the whole directory is not efficient either as it will take an hour to archive
You can use the rsyncable option of gzip or pigz to compress the tar file to .gz format. (You will likely have to find a patch for gzip to add that. It's already part of pigz.)
The option partitions the resulting gzip file in a way that permits rsync to find only the modified portions for much more efficient transfers when only some of the files in the .tar.gz file have been changed.
I was looking for exact same thing as you and I landed on using borg.
tar cf - -C $DIR . | borg create $REPO::$NAME
tar will still read entire folder so you won't avoid a read penalty versus just rsyncing two dirs (since I believe rsync uses tricks to avoid reading each file for changes), but you will avoid the write penalty because borg will only write blocks it hasn't encountered before. Also borg auto compresses so no need for xz/gzip. Also, if borg is installed on both ends it won't send over superfluous data either since the two borgs can let each other know what they have versus don't.
If avoiding that read penalty is crucial for you, you could possibly use rsync to use its tricks to just tell you which files changed, create a difftar and send that to borg, but then getting borg to merge archives is whole second headache. You'd likely end up creating a filter that removes paths that were deleted from the original archive and then creating a new archive of just file additions/changes. And then you'd have to do that for each archive recursively. In the end it will create the original archive by extracting each version in sequence, but like I said a total headache.

Reading Input Data from GCS

What is the suggest way of loading data from GCS? The sample code shows copying the data from GCS to the /tmp/ directory. If this is the suggest approach, how much data may be copied to /tmp/?
While you have that option, you shouldn't need to copy the data over to local disk. You should be able to reference training and evaluation data directly from GCS, by referencing your files/objects using their GCS URI -- eg. gs://bucket/path/to/file. You can use these paths where you'd normally use local file system paths in TensorFlow APIs that accept file paths. TensorFlow supports the ability to access data (and write to) GCS.
You should also be able to use a prefix to reference a set of matching files, rather than referencing each file individually.
Followup note -- you'll want to check out https://cloud.google.com/ml/docs/how-tos/using-external-buckets in case you need to appropriately ACL your data for being accessible to training.
Hope that helps.

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.