gsutil compose fails composing file larger than 250mb - google-cloud-platform

To move a medium sized blob (>250mb) from one place to another in GCS (gsutil cp my_blob my_new_blob),
gsutil wants me to compose it :
so I am doing gsutil compose my_blob my_blob to compose it and overcome this error, but I then get another error:
where it would just retry again and again and I would finally get a
503 - We encountered an internal error - Try again
error.
Why is this happenning ? Is there a limit to the size of the file to be composed and why this limit would be only 250mb ?

Trid it on my end using this docs Cloud Storage cp options.
$ gsutil -o "GSUtil:max_upload_compression_buffer_size=8G" -m cp -J filetest filtest_new
Copying file://filetest...
/ [1/1 files][300.0 MiB/300.0 MiB] 100% Done
Operation completed over 1 objects/300.0 MiB.
Tried to simplify it, same Chaotic comments with slight changes
gsutil -m cp filetest filtest_new
XXXXX#cloudshell:~ (XXXXX)$ gsutil -m cp filetest filtest_new2
Copying file://filetest...
/ [1/1 files][300.0 MiB/300.0 MiB] 100% Done
Operation completed over 1 objects/300.0 MiB.

Related

Can not use -m multi threading with ** in gsutil rm

Using gustil, I want to delete all objects in a bucket. But I do not want to delete the bucket. The job takes about 2 minutes and I would like to speed it up with the -m option for multi threading.
However, when I try gsutil -m rm -a gs://dls-qa/** I get 'CommandException: Incorrect option(s) specified'. Am I not allowed to use -m with the double asterisk ** which is required to keep the bucket?
https://cloud.google.com/storage/docs/gsutil/commands/rm
I tried the same command and it worked in my case :
gsutil -m rm -a "gs://mazlum_test/**"
I added " chars in my bucket pattern because I am using omyzsh
Maybe you have to upgrade your gcloud sdk version.

gsutil taking 3 hrs to transfer 5000 file from one bucket to another bucket

I am trying to transfer 7000 file from one bucket to another bucket using below command. but it is taking approximately 3 hrs to complete. How to optimize this time within as 5 mins
for ELEMENT in $list; do
gsutil -m mv -r gs://${FILE_PATH}/$ELEMENT gs://${GCS_STORAGE_PATH}/
done
I think you're not actually parallelizing the gsutil mv; well, you are but only for one file (not many).
You're sending each ${ELEMENT} to gsutil -m mv in turn which means the command only uploads a single file and blocks while doing so.
Once each file is mv'd, the next file is processed.
To use gsutil -m correctly, I think you need to pass the command a wildcard that represents the set of files to e.g. mv in parallel.
If you're unable to provide gsutil with a pattern to match the set of files (and thus have it parallelize the command), another option, though you'll want to try to limit the number of tasks and it will be trickier to capture the stdout|stderr from gsutil, is to background (&) each gsutil command.

Gsutil creates 'tmp' files in the buckets

We are using automation scripts to upload thousands of files from MAPR HDFS to GCP storage. Sometimes the files in the main bucket appear with tmp~!# suffix it causes failures in our pipeline.
Example:
gs://some_path/.pre-processing/file_name.gz.tmp~!#
We are using rsync -m and in certain cases cp -I
some_file | gsutil -m cp -I '{GCP_DESTINATION}'
gsutil -m rsync {MAPR_SOURCE} '{GCP_DESTINATION}'
It's possible that copy attempt failed and retried later from a different machine, eventually, we have both the file and another one with the tmp~!# suffix
I'd want to get rid of these files without actively looking for them.
we have gsutil 4.33, appreciate any lead. Thx

Compress a directory in google cloud storage bucket and then download in local directory

I have a directory named bar in google cloud storage bucket foo. There are around 1 million small files (each around 1-2 kb) in directory bar.
According to this reference if I have a large number files I should use gsutil -m option to download the files, like this:
gsutil -m cp -r gs://foo/bar/ /home/username/local_dir
But given the number of total files (around 10^6), the whole process of downloading the files is still slow.
Is there a way so that I can compress the whole directory in cloud storage and then download the compressed directory to the local folder?
There's no way to compress the directory in the cloud before copying, but you could speed up the copy by distributing the processing across multiple machines. For example, have scripts so
machine1 does gsutil -m cp -r gs://<bucket>/a* local_dir
machine2 does gsutil -m cp -r gs://<bucket>/b* local_dir
etc.
Depending on how your files are named you may need to adjust the above, but hopefully you get the idea.

aws s3 cp clobbers files?

Um, not quite sure what to make out of this.
I am trying to download 50 files from S3 to EC2 machine.
I ran:
for i in `seq -f "%05g" 51 101`; do (aws s3 cp ${S3_DIR}part-${i}.gz . &); done
A few minutes later, I checked on pgrep -f aws and found 50 processes running. Moreover, all files were created and started to download (large files, so expected to take a while to download).
At the end, however, I got only a subset of files:
$ ls
part-00051.gz part-00055.gz part-00058.gz part-00068.gz part-00070.gz part-00074.gz part-00078.gz part-00081.gz part-00087.gz part-00091.gz part-00097.gz part-00099.gz part-00101.gz
part-00054.gz part-00056.gz part-00066.gz part-00069.gz part-00071.gz part-00075.gz part-00080.gz part-00084.gz part-00089.gz part-00096.gz part-00098.gz part-00100.gz
Where is the rest??
I did not see any errors, but I saw these for successfully completed files (and these are the files that are shown in the ls output above):
download: s3://my/path/part-00075.gz to ./part-00075.gz
If you are copying many objects to/from S3, you might try the --recursive option to instruct aws-cli to copy multiple objects:
aws s3 cp s3://bucket-name/ . --recursive --exclude "*" --include "part-*.gz"