We are using automation scripts to upload thousands of files from MAPR HDFS to GCP storage. Sometimes the files in the main bucket appear with tmp~!# suffix it causes failures in our pipeline.
Example:
gs://some_path/.pre-processing/file_name.gz.tmp~!#
We are using rsync -m and in certain cases cp -I
some_file | gsutil -m cp -I '{GCP_DESTINATION}'
gsutil -m rsync {MAPR_SOURCE} '{GCP_DESTINATION}'
It's possible that copy attempt failed and retried later from a different machine, eventually, we have both the file and another one with the tmp~!# suffix
I'd want to get rid of these files without actively looking for them.
we have gsutil 4.33, appreciate any lead. Thx
Related
I am trying to backup all of our Google Cloud data to an external storage device.
There is a lot of data so I am attempting to download the entire bucket at once and am using the following command to do so, but it halts saying that there isn't enough storage on the device to complete the transfer.
gsutil -m cp -r \
"bucket name" \
.
What do I need to add to this command to download this information to my local D: drive? I have searched through the available docs and have not been able to find the answer.
I used the gsutil command that GCP provided for me automatically, but it seems to be trying to copy the files to a destination without enough storage to hold the needed data.
Remember that you are running the command from the Cloud Shell and not in a local terminal or Windows Command Line. If you inspect the Cloud Shell's file system/structure, it resembles more that of a Unix environment in which you can specify the destination like such instead: ~/bucketfiles/. Even a simple gsutil -m cp -R gs://bucket-name.appspot.com ./ will work since Cloud Shell can identify the ./ directory which is the current directory.
A workaround to this is to perform the command on your Windows Command Line. You would have to install Google Cloud SDK beforehand.
Alternatively, this can also be done in Cloud Shell, albeit with an extra step:
Download the bucket objects by running gsutil -m cp -R gs://bucket-name ~/ which will download it into the home directory in Cloud Shell
Transfer the files downloaded in the ~/ (home) directory from Cloud Shell to the local machine either through the User Interface or by running gcloud alpha cloud-shell scp.
I am trying to transfer 7000 file from one bucket to another bucket using below command. but it is taking approximately 3 hrs to complete. How to optimize this time within as 5 mins
for ELEMENT in $list; do
gsutil -m mv -r gs://${FILE_PATH}/$ELEMENT gs://${GCS_STORAGE_PATH}/
done
I think you're not actually parallelizing the gsutil mv; well, you are but only for one file (not many).
You're sending each ${ELEMENT} to gsutil -m mv in turn which means the command only uploads a single file and blocks while doing so.
Once each file is mv'd, the next file is processed.
To use gsutil -m correctly, I think you need to pass the command a wildcard that represents the set of files to e.g. mv in parallel.
If you're unable to provide gsutil with a pattern to match the set of files (and thus have it parallelize the command), another option, though you'll want to try to limit the number of tasks and it will be trickier to capture the stdout|stderr from gsutil, is to background (&) each gsutil command.
I have a problem downloading entire folder in GCP. How should I download the whole bucket? I run this code in GCP Shell Environment:
gsutil -m cp -R gs://my-uniquename-bucket ./C:\Users\Myname\Desktop\Bucket
and I get an error message: "CommandException: Destination URL must name a directory, bucket, or bucket subdirectory for the multiple source form of the cp command. CommandException: 7 files/objects could not be transferred."
Could someone please point out the mistake in the code line?
To download an entire bucket You must install google cloud SDK
then run this command
gsutil -m cp -R gs://project-bucket-name path/to/local
where path/to/local is your path of local storage of your machine
The error lies within the destination URL as specified by the error message.
I run this code in GCP Shell Environment
Remember that you are running the command from the Cloud Shell and not in a local terminal or Windows Command Line. Thus, it is throwing that error because it cannot find the path you specified. If you inspect the Cloud Shell's file system/structure, it resembles more that of a Unix environment in which you can specify the destination like such instead: ~/bucketfiles/. Even a simple gsutil -m cp -R gs://bucket-name.appspot.com ./ will work since Cloud Shell can identify the ./ directory which is the current directory.
A workaround to this issue is to perform the command on your Windows Command Line. You would have to install Google Cloud SDK beforehand.
Alternatively, this can also be done in Cloud Shell, albeit with an extra step:
Download the bucket objects by running gsutil -m cp -R gs://bucket-name ~/ which will download it into the home directory in Cloud Shell
Transfer the files downloaded in the ~/ (home) directory from Cloud Shell to the local machine either through the User Interface or by running gcloud alpha cloud-shell scp
Your destination path is invalid:
./C:\Users\Myname\Desktop\Bucket
Change to:
/Users/Myname/Desktop/Bucket
C: is a reserved device name. You cannot specify reserved device names in a relative path. ./C: is not valid.
There is not a one-button solution for downloading a full bucket to your local machine through the Cloud Shell.
The best option for an environment like yours (only using the Cloud Shell interface, without gcloud installed on your local system), is to follow a series of steps:
Downloading the whole bucket on the Cloud Shell environment
Zip the contents of the bucket
Upload the zipped file
Download the file through the browser
Clean up:
Delete the local files (local in the context of the Cloud Shell)
Delete the zipped bucket file
Unzip the bucket locally
This has the advantage of only having to download a single file on your local machine.
This might seem a lot of steps for a non-developer, but it's actually pretty simple:
First, run this on the Cloud Shell:
mkdir /tmp/bucket-contents/
gsutil -m cp -R gs://my-uniquename-bucket /tmp/bucket-contents/
pushd /tmp/bucket-contents/
zip -r /tmp/zipped-bucket.zip .
popd
gsutil cp /tmp/zipped-bucket.zip gs://my-uniquename-bucket/zipped-bucket.zip
Then, download the zipped file through this link: https://storage.cloud.google.com/my-uniquename-bucket/zipped-bucket.zip
Finally, clean up:
rm -rf /tmp/bucket-contents
rm /tmp/zipped-bucket.zip
gsutil rm gs://my-uniquename-bucket/zipped-bucket.zip
After these steps, you'll have a zipped-bucket.zip file in your local system that you can unzip with the tool of your choice.
Note that this might not work if you have too much data in your bucket and the Cloud Shell environment can't store all the data, but you could repeat the same steps on folders instead of buckets to have a manageable size.
I have a directory named bar in google cloud storage bucket foo. There are around 1 million small files (each around 1-2 kb) in directory bar.
According to this reference if I have a large number files I should use gsutil -m option to download the files, like this:
gsutil -m cp -r gs://foo/bar/ /home/username/local_dir
But given the number of total files (around 10^6), the whole process of downloading the files is still slow.
Is there a way so that I can compress the whole directory in cloud storage and then download the compressed directory to the local folder?
There's no way to compress the directory in the cloud before copying, but you could speed up the copy by distributing the processing across multiple machines. For example, have scripts so
machine1 does gsutil -m cp -r gs://<bucket>/a* local_dir
machine2 does gsutil -m cp -r gs://<bucket>/b* local_dir
etc.
Depending on how your files are named you may need to adjust the above, but hopefully you get the idea.
Um, not quite sure what to make out of this.
I am trying to download 50 files from S3 to EC2 machine.
I ran:
for i in `seq -f "%05g" 51 101`; do (aws s3 cp ${S3_DIR}part-${i}.gz . &); done
A few minutes later, I checked on pgrep -f aws and found 50 processes running. Moreover, all files were created and started to download (large files, so expected to take a while to download).
At the end, however, I got only a subset of files:
$ ls
part-00051.gz part-00055.gz part-00058.gz part-00068.gz part-00070.gz part-00074.gz part-00078.gz part-00081.gz part-00087.gz part-00091.gz part-00097.gz part-00099.gz part-00101.gz
part-00054.gz part-00056.gz part-00066.gz part-00069.gz part-00071.gz part-00075.gz part-00080.gz part-00084.gz part-00089.gz part-00096.gz part-00098.gz part-00100.gz
Where is the rest??
I did not see any errors, but I saw these for successfully completed files (and these are the files that are shown in the ls output above):
download: s3://my/path/part-00075.gz to ./part-00075.gz
If you are copying many objects to/from S3, you might try the --recursive option to instruct aws-cli to copy multiple objects:
aws s3 cp s3://bucket-name/ . --recursive --exclude "*" --include "part-*.gz"