Does gsutil rsync bucket1 bucket2 download/upload files locally? - google-cloud-platform

I have a dumb question.
So I have terrabytes of data to rsync between two GCP buckets.
I'm not too sure with how gsutil rsync works behind the scenes.
Does it have to download the files locally before it uploads it to destination or does it just magically move things over from source bucket to destination?

The answer to your question is in the gsutil rsync documentation:
Note 2: If you are synchronizing a large amount of data between clouds you might consider setting up a Google Compute Engine account and running gsutil there. Since cross-provider gsutil data transfers flow through the machine where gsutil is running, doing this can make your transfer run significantly faster than running gsutil on your local workstation.
So yes, it downloads the content locally first, then uploads it to the destination.

I performed a test with RSYNC and the debug flags and I noticed this behaviour
When you move an object (using cp or rsync) between buckets this is not downloaded to your local machine, I used a file of ~4GB and glances to measure the network usage during rsync operation, the objects were directly moved to the target bucket
If you run the following command you going to notice that the SDK perform a post request indicating the movement between buckets
gsutil -d rsync gs://sourcebucket gs://targetbucket
https://storage.googleapis.com/storage/v1/b/sourcebucket/o/bigfile.iso/rewriteTo/b/targetbucket/o/bigfile.iso
Rewriteto behaviour is documented here

Related

easiest way to copy data from a VM instance in GCP?

I have a VM with 60GB and 45GB is used. It has multiple users.
I have deleted old users and now I have about 40GB used. I want to download all the data to my machine. How do I do that? The only way I know is with scp the command is something like this one: gcloud compute scp --recurse <$USER>#<INSTANCE>:/home/$USER ~/$LOCAL_DIR but it's very slow and might even take a couple of days. Is there a better and easier way to download the data?
The bottleneck here seems to be that you're trying to download a considerable amount of data over SSH which is quite slow.
One thing you can try to speed up the process is break down the download into two parts:
Upload the content of your VM to Cloud Storage Bucket
Download the content from Cloud Storage to your machine
So in that case, from the VM you'll execute:
gsutil cp -R /home/$USER gs://BUCKET_NAME
And then from your local machine:
gsutil cp -R gs://BUCKET_NAME ~/$LOCAL_DIR
Gsutil uses parallel composite uploads to speed up uploading large files, in case you have any. And in the first step you'll be doing GCP <-> GCP communication which will be faster than downloading from SSH directly.

Migrating GCP Object storage data to AWS S3 Bucket

We have terabytes of Data in Google Cloud Object Storage, we want to migrate it to AWS S3. What are the best ways to do it? Is there any 3rd party tool that can be better instead of going for direct transfer?
There could be multiple options available even without using any device (cloud to cloud migration) in less time.
** gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:**
gsutil -m rsync -r gs://your-gcp-bucket s3://your-aws-s3-bucket
More details is available # https://cloud.google.com/storage/docs/gsutil/commands/rsync
Note: if you face speed challenge with default cloud shell, then you can create a big machine and execute above command from there.

How to Transfer multiple files from Linux server to AWS

Can someone let me know how to transfer multiple files from Linux server to AWS?
If you are wanting to copy the data to Amazon S3, the easiest method is to use the AWS Command-Line Interface (CLI), either:
aws s3 cp --recursive or
aws s3 sync
The sync command automatically recurses sub-directories and is generally a better option because it can be re-run and only copies files modified or added since the previous execution. Thus, it can be used to continue the copy after a failure, or the next day when new files have been adeed.
Did you try using scp or sftp to transfer files. If your local machine is a linux one, you can use the console, otherwise putty in a windows machine.

aws cli copy command halted

I used Putty to get into my AWS instance and ran a cp command to copy files into my S3 instance.
aws cli cp local s3://server_folder --recursive
Partway through, my internet dropped out and the copy halted even though the AWS instances was still running properly. Is there a way to make sure the cp command keeps running even if I lose my connection?
You can alternatively use Minio Client aka mc,it is open source and is compatible with AWS S3. Minio client is available for Windows along with mac, Linux.
The mc mirror command will help you in copying local content to remote AWS S3 bucket, incase of network issue the upload fails mc session resume will start uploading from where connection was terminated.
mc supports these commands.
COMMANDS:
ls List files and folders.
mb Make a bucket or folder.
cat Display contents of a file.
pipe Write contents of stdin to one target. When no target is specified, it writes to stdout.
share Generate URL for sharing.
cp Copy one or more objects to a target.
mirror Mirror folders recursively from a single source to single destination.
diff Compute differences between two folders.
rm Remove file or bucket [WARNING: Use with care].
access Set public access permissions on bucket or prefix.
session Manage saved sessions of cp and mirror operations.
config Manage configuration file.
update Check for a new software update.
version Print version.
You can check docs.minio.io for more details.
Hope it helps.
Disclaimer: I work for Minio.

downloading a file from Internet into S3 bucket

I would like to grab a file straight of the Internet and stick it into an S3 bucket to then copy it over to a PIG cluster. Due to the size of the file and my not so good internet connection downloading the file first onto my PC and then uploading it to Amazon might not be an option.
Is there any way I could go about grabbing a file of the internet and sticking it directly into S3?
Download the data via curl and pipe the contents straight to S3. The data is streamed directly to S3 and not stored locally, avoiding any memory issues.
curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file
As suggested above, if download speed is too slow on your local computer, launch an EC2 instance, ssh in and execute the above command there.
For anyone (like me) less experienced, here is a more detailed description of the process via EC2:
Launch an Amazon EC2 instance in the same region as the target S3 bucket. Smallest available (default Amazon Linux) instance should be fine, but be sure to give it enough storage space to save your file(s). If you need transfer speeds above ~20MB/s, consider selecting an instance with larger pipes.
Launch an SSH connection to the new EC2 instance, then download the file(s), for instance using wget. (For example, to download an entire directory via FTP, you might use wget -r ftp://name:passwd#ftp.com/somedir/.)
Using AWS CLI (see Amazon's documentation), upload the file(s) to your S3 bucket. For example, aws s3 cp myfolder s3://mybucket/myfolder --recursive (for an entire directory). (Before this command will work you need to add your S3 security credentials to a config file, as described in the Amazon documentation.)
Terminate/destroy your EC2 instance.
[2017 edit]
I gave the original answer back at 2013. Today I'd recommend using AWS Lambda to download a file and put it on S3. It's the desired effect - to place an object on S3 with no server involved.
[Original answer]
It is not possible to do it directly.
Why not do this with EC2 instance instead of your local PC? Upload speed from EC2 to S3 in the same region is very good.
regarding stream reading/writing from/to s3 I use python's smart_open
You can stream the file from internet to AWS S3 using Python.
s3=boto3.resource('s3')
http=urllib3.PoolManager()
urllib.request.urlopen('<Internet_URL>') #Provide URL
s3.meta.client.upload_fileobj(http.request('GET', 'Internet_URL>', preload_content=False), s3Bucket, key,
ExtraArgs={'ServerSideEncryption':'aws:kms','SSEKMSKeyId':'<alias_name>'})