Exclude small files when using gsutil rsync - google-cloud-platform

I would like to upload files of a given folder to a bucket using gsutil rsync. However, instead of uploading all files, I would like to exclude files that are below a certain size. The Unix rsync command offers the option --min-size=SIZE. Is there an equivalent for the gsutil tool? If not, is there an easy way of excluding small files?

Ok so the easiest solution I found is to move the small files into a subdirectory, and then use rsync (without the -r option). The code for moving the files:
def filter_images(source, limit):
imgs = [img for img in glob(join(source, "*.tiff")) if (getsize(img)/(1024*1024.0)) < limit]
if len(imgs) == 0:
return
filtered_dir = join(source, "filtered")
makedirs(filtered_dir)
for img in imgs:
shutil.move(img, filtered_dir)

You don't have this option. You can perform manually by scripting this and send file by file. But it's not very efficient. I propose you this command:
find . -type f -size -4000c | xargs -I{} gsutil cp {} gs://my-bucket/path
Here only the file bellow 4k will be copied. Here the list of find unit
c for bytes
w for two-byte words
k for Kilobytes
M for Megabytes
G for Gigabytes

Related

Reading first few lines from files in google cloud storage

While processing huge files ~100GB file size, sometime we need to check first/last few lines (header and trailer lines).
The easy option is to download entire file locally using
gsutil cp gs://bucket_name/file_name .
and then use head/tail command to check header/trailer lines which is not feasible as it will be time consuming and associated cost of extracting data from cloud.
It is same as performing -
gsutil cat gs://bucket_name/file_name | head -1
The other option is to create external table in GCP Tables OR visualize them in datastudio OR read from dataproc cluster/VM.
Is there any other quick option just to check header/trailer lines from cloud storage ?
gsutil cat -r
is the key here.
It output just the specified byte range of object. Offsets starts with 0.
Eg.
To return bytes from 10th to 100th position from the file :
gsutil cat -r 10-100 gs://bucket_name/file_name
To return bytes from 100th till end of file :
gustil cat -r 100- gs://bucket_name/file_name
To return last 10 bytes from the files :
gsutil cat -r -10 gs://bucket_name/file_name

convert all BMP files recursively to JPG handling paths with spaces and getting the file extension right under Linux

I have files with beautiful, glob-friendly pathnames such as:
/New XXXX_Condition-selected FINAL/677193 2018-06-08 Xxxx Event-Exchange_FINAL/Xxxxx Dome Yyyy Side/Xxxx_General016 #07-08.BMP
(the Xxx...Yyyy strings are for privacy reasons). Of course the format is not fixed: the depth of the folder hierarchy can vary, but spaces, letters and symbols such as _, - and # can all appear, either as part of the path or part of the filename, or both.
My goal is to recurse all subfolders, find the .BMP files and convert them to JPG files, without having "double" extensions such as .BMP.JPG: in other words, the above filename must become
/New XXXX_Condition-selected FINAL/677193 2018-06-08 Xxxx Event-Exchange_FINAL/Xxxxx Dome Yyyy Side/Xxxx_General016 #07-08.JPG
I can use either bash shell tools or Python. Can you help me?
PS I have no need for the original files, so they can be overwritten. Of course a solution which doesn't overwrite them is also fine - I'll just follow up with a find . -name "*.BMP" -type f -delete command.
Would you please try:
find . -type f -iname "*.BMP" -exec mogrify -format JPG '{}' +
The command mogrify is a tool of ImageMagick suite and mogrify -format JPG file.BMP is equivalent to convert file.BMP file.JPG.
You can add the same options which are accepted by convert such as -quality.
The benefit of mogrify is it can perform the same conversion on multiple files all at once without specifying the output (converted) filenames.
If the command issues a warning: mogrify-im6.q16: length and filesize do not match, it means the image size stored in the BMP header discords with the actual size of image data block.
If JPG files are properly produced, you may ignore the warnings. Otherwise you will need to repair the BMP files which cause the warnings.
If the input files and the output files have the same extention (in such a
case JPG to JPG conversion with a resize), the original files are overwritten.
If they have different extentions like this time, the original BMP files are
not removed. You can remove them using the find as well.

zgrep in hadoop streaming

I am trying to grep a zip file on S3/aws & write the output to a new location with same file name
I am using below on s3 , is this the right way to write the streaming output from first CAT command to hdfs output?
hadoop fs -cat s3://analytics/LZ/2017/03/03/test_20170303-000000.tar.gz | zgrep -a -E '*word_1*|*word_2*|word_3|word_4' | hadoop fs -put - s3://prod/project/test/test_20170303-000000.tar.gz
Given you are playing with hadoop, why not run the code in cluster? Scanning for strings inside a .gzip file is commonplace, though I don't know about .tar files.
I'd personally use the -copyToLocal and -copyFromLocal commands to copy it to the local FS and work there. The trouble with things like -cat is that a lot gets logged out on the Hadoop client code, so a pipe is likely to pick up too much extraneous cruft,

Is it possible to exclude from aws S3 sync files older then x time?

I'm trying to use aws s3 CLI command to sync files (then delete a local copy) from the server to S3 bucket, but can't find a way to exclude newly created files which are still in use in local machine.
Any ideas?
This should work:
find /path/to/local/SyncFolder -mtime +1 -print0 | sed -z 's/^/--include=/' | xargs -0 /usr/bin/aws s3 sync /path/to/local/SyncFolder s3://remote.sync.folder --exclude '*'
There's a trick here: we're not excluding the files we don't want, we're excluding everything and then including the files we want. Why? Because either way, we're probably going to have too many parameters to fit into the command line. We can use xargs to split up long lines into multiple calls, but we can't let xargs split up our excludes list, so we have to let it split up our includes list instead.
So, starting from the beginning, we have a find command. -mtime +1 finds all the files that are older than a day, and -print0 tells find to delimit each result with a null byte instead of a newline, in case some of your files have newlines in their names.
Next, sed adds the --include=/ option to the start of each filename, and the -z option is included to let sed know to use null bytes instead of newlines as delimiters.
Finally, xargs will feed all those include options to the end of our aws command, calling aws multiple times if need be. The -0 option is just like sed's -z option, telling it to use null bytes instead of newlines.
To my knowledge you can only Include/ Exclude based on Filename. So the only way I see is a realy dirty hack.
You could run a bash script to rename all files below your treshhold and prefix/ postfix them like TOO_NEW_%Filename% and run cli like:
--exclude 'TOO_NEW_*'
But no don't do that.
Most likely ignoring the newer files is the default behavior. We can read in aws s3 sync help:
The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
If you'd like to change the default behaviour, you've the following parameters to us:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
--exact-timestamps (boolean) When syncing from S3 to local, same-sized
items will be ignored only when the timestamps match exactly. The
default behavior is to ignore same-sized items unless the local version
is newer than the S3 version.
To see what files are going to be updated, run the sync with --dryrun.
Alternatively use find to list all the files which needs to be excluded, and pass it into --exclude parameter.

single repeating command with input and output files

I have been trying to learn how to adequately perform a single command multiple times using the command line. Although I have learned how to do a single command with no input and output files, it gets more complicated when it needs these.
The cp command requires this so lets use this as an example. I look for all images with .png extension and copy them. The way I have come up with after using google is:
find -regex ".*\.\(png\)" -exec cp {} {}3 \;
The only problem with that is that I have to rename the file with any figure after the name, so it gets renamed to something like file.png3 instead of file.png. I can't figure out how to do if differently as I can't put the new figure before the name as it doesn't seem to work.
Is there a better way to do this or am I going about it completely the wrong way?
I'm not sure how you might do that in a single find command, but you could split it out. First, find the files with find. Then use sed to remove the .png extension. Finally, use xargs to run the copy function on each file. Like this:
find -regex ".*\.\(png\)" | sed -r 's/.png//g' | xargs -I {} cp {}.png {}_copy.png
If you didn't know, the pipe "|" will send the output of one program into the next.
Alternatively, you could just modify the beginning of the filename (so 3img.png instead of img.png3) or copy to a new folder.