Copying multiple files inside a Google Cloud bucket to different directories based on file name - google-cloud-platform

Suppose I have multiple files in different sub-directories with names like 20060630 AD8,11 +1015.WAV and 20050508_Natoa_Enc1_AD5AK_1.WAV. Now I know that all these files will have a substring like AD (in the first file) and AD, AK (in the second). There are total 16 of these classes (AD, AK, AN etc) that I've made as empty folders in the top level directory.
I want to copy all these files according to the substrings matched into their respective directory. Now using gsutil, the commands may come like:
gsutil cp gs://bucket/Field/2005/20060630 AD8,11 +1015.WAV gs://bucket/AD/20060630 AD8,11 +1015.WAV
How can this approach work for automating the task for thousands of files in the same bucket?
Is it safe to assume an approach like:
if 'AD' in filename:
gsutil cp gs://bucket/<filename> gs://bucket/AD/<filename>
elif 'AK' in filename:
gsutil cp gs://bucket/<filename> gs://bucket/AK/<filename>

You can write an easy BASH script for this. The code would be pretty simple since gsutil supports wildcards and it can recursively dive into sub-directories to find your files.
#!/bin/bash
bucket_name=my-example-bucket
substring_list=(
AD
AK
AN
)
for substring in "${substring_list[#]}"; do
gsutil cp gs://$bucket_name/**/*$substring* gs://$bucket_name/$substring/
done
I also see that you have some Python experience, so you could alternatively leverage the Python Client for Google Cloud Storage along with a similar wildcard strategy.

Related

gsutil rsync between gzip/non-gzip local/cloud locations

For change detection, can gsutil's rsync use the gzip'd size for change detection?
Here's the situation:
Uploaded non-gzip'd static site content to a bucket using cp -Z so it's compressed at rest in the cloud.
Modify HTML files locally.
Need to rsync only the locally modified files.
So the upshot is that the content is compressed in the cloud and uncompressed locally. Can rsync be used to figure out what's changed?
From what I've tried, I'm thinking no because of the way rsync does it's change detection:
If -c is used, compare checksums but ONLY IF file sizes are the same.
Otherwise use times.
And it doesn't look like -J/-j impacts comparing the file size (the local uncompressed filesize is compared against the compressed cloud version which of course is FALSE) so -c won't kick in. Then, the times won't match and thus everything is uploaded again.
This seems like a fairly common use case. Is there a way of solving this?
Thank you,
Hans
To figure out how rsync identifies what has been changed while using gsutils please check Change Detection Algorithm.
I am unsure how do you want to compare between gzip non-gzip, but maybe gsutil compose could be used to make that middle step while compare between files before being compressed.
Take into account that in gsutils rsync's 4th limitation:
The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because Cloud Storage objects are immutable and no facility exists to read partial object checksums or perform partial replacements.

Batch File to compare 1 local and 1 network folder and copy local to another local if they match

I have a network folder that contains sub-folders and files on a network drive. I want to automate copying the files and folders to my 4 local computers. Due to bandwidth issues I have a scheduled task that pulls the update over at night to a single computer. I would like a batch file for the other 3 local computers that can verify when the 2 folders on separate devices (1 local and 1 remote) are in sync then copy the local files to itself.
I have looked through Robocopy, and several of the other compare commands and I see they give me a report of the differences, but what I am looking for is something conditional to continue batch processing. I would execute it from a scheduled task, but basically it would perform like:
IF \remotepc\folder EQU \localpc1\folder" robocopy "\localpc1\folder" "c:\tasks\updater" /MIR
ELSE GOTO EOF
Thanks in advance. Any help is appreciated.
A Beyond Compare script can compare two folders, then copy matching files to a third folder.
Script:
load c:\folder1 \\server1\share\folder1
expand all
select left.exact.files
copyto left path:base c:\output
To run the script, use the command line:
"c:\program files\beyond compare 4\bcompare.exe" "#c:\script.txt"
The # character makes Beyond Compare run a file as a script instead of loading it for GUI comparison.
Beyond Compare scripting resources:
Beyond Compare Help - Scripting
Beyond Compare Help - Scripting Reference
Beyond Compare Forums - Scripting

Caching for faster recompression of folder given edit/add/delete

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

PDI - Collecting File From FTP Older Than N Day

I have a job that will Collect Data from FTP using Get a file with FTP and I want it's only collect yesterday file or older than n day or base on specific date.
How do that? Is any way or possible?
What I know is Get a file with FTP only copy file directly from FTP to destination folder. So, I can't use any field and assign it into JavaScript variable to create condition.
My requirement is moving only yesterday or ... file from FTP into Location I need, not all of them because I have a lot of file about 30K-40K with various file size and it will took a lot of time if I do that.
Below is the pic what I have design.
There a Scripting/Shell job entry on which you can put any shell script, including :
find . -mindepth 1 -maxdepth 1 -mtime -7 -exec mv -t /destination/path {} +
For eplanation on the shell script have a look there : https://unix.stackexchange.com/questions/207679/moving-files-modified-after-a-specific-date
By using the 'Get File Names' step in a transformation, you can access your FTP files (via VFS) and their atributes, namely the 'lastmodifiedtime'.
With this information you can do a simple filter by dates, and only download the files which are older than N days, or any other filter you require. With that in hand you can move, download or any other file related action you desire.

What are the atomic guarantees on copying a folder in GCS?

I have a GCS bucket containing a directory my-bucket-name/my-temp-dir-name. This directory contains many subfiles. I would like to execute a copy command, e.g. gsutil cp gs://my-bucket-name/my-temp-dir-name gs://my-bucket-name/my-dir-name.
Are there any atomic guarantees around this operation? Is it possible that some files will be accessible in my-dir-name before all the files are available? What if my-dir-name already exists?
Individual object copies are atomic, but GCS does not support atomicity of copies across multiple objects.
your-dir-name must exist before copying, otherwise the cp operation will result in a 404 when trying to locate the bucket.
Objects are copied independently (one at a time, or in parallel, depending on whether you super-power gsutil with the -m flag). Therefore, files will start to appear in your-dir-name as soon as they make it up to the cloud.
Note that objects in GCS are immutable, and operations are atomic at the object level. This means that the latest uploaded object wins: replaces the previous one(s). If you are interested in keeping previous version, you can enable versioning, and a N number of copies will be kept.
Bonus: If you are copying multiple files at once, use the -m flag to upload more than one object at the same time, like so:
gsutil -m cp -r gs://my-bucket-name/my-temp-dir-name gs://my-bucket-name/my-dir-name
or
gsutil -m cp gs://my-bucket-name/my-temp-dir-name/* gs://my-bucket-name/my-dir-name