syncing 500GB of data from one GCP bucket to another

syncing 500GB of data from one GCP bucket to another - google-cloud-platform

I am trying to sync 500GB of data from one bucket to another across project. Using below command I am able to initiate the sync and it is copying files. But it is taking way long(in 2 days copied 50GB). Is there a faster way around it.
Note: I have nested folders in source bucket and file count is around 74 Million.
gsutil -m rsync -r gs://source_bucket gs://destination_bucket

Try the following tool
Documentation

It's more on the step 4 that the interesting happens

New solution came out
gcloud storage

Related

Faster method to download 6.5m objects from GCP bucket

I'm looking for a faster method to download a ton of objects (6.5 million in my case) from a bucket. The average object size is 2kb (it's a JSON file). The method I used was gsutil -m cp -r gs://<bucket>/<folder> . which took 14 hours for 1M objects.
It's not feasible to run this on my laptop for 7 days straight. Any ideas?
PS: I don't need them to be in individual JSON files. I'm thinking to create a script that pulls a file from the bucket, and adds a row to a CSV, then deletes the file.

Try downloading the files to a VM, compressing the files into a single tgz (or bz2 or xz), upload back to the bucket, and download the tgz.
Cloud shell should work too.

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?

You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).

try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.

I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.

So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.

try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

More efficient use of aws s3 sync?

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.

As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu

S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?

The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.

aws s3 mv/sync command

I have about 2 million files nested in subfoldrs in a bucket and want to move all of them to another bucket. Spending much of time on searching ... i found a solution to use AWS CLI mv/sync command. use move command or use sync command and then delete all the files after successfully synced.
aws s3 mv s3://mybucket/ s3://mybucket2/ --recursive
or it can be as
aws s3 sync s3://mybucket/ s3://mybucket2/
But the problem is how would i know that how many files/folders have moved or synced and how much time would it take...
And what if some exception occurs(machine/server stops/ internet disconnection due to any reason )...i have to again execute the command or it will for surely complete and move/sync all files. How can i be sure about the number of files moved/synced and files not moved/synced.
or can i have something like that
I move limited number of files e.g 100 thousand.. and repeat until all files are moved...
or move files on the basis of uploaded time.. e.g files uploaded from starting date to ending date
if yes .. how?

To sync them use:
aws s3 sync s3://mybucket/ s3://mybucket2/
You can repeat the command, after it finish (or fail) without issue. This will check if anything is missing/different to the target s3 bucket and will process it again.
The time depends on what size are the files, how much objects you have. Amazon counts directories as an object, so they matter too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js