While performing an aws s3 cp --recursive s3://src-bucket s3://dest-bucket command, will it download the files locally and upload them to the destination bucket? Or (hopefully) will this entire transaction happen on AWS without files ever hitting your instantiating machine?
Thanks
The copy happens within AWS. I verified this as follows using awscli on an Ubuntu EC2 instance:
upload 4GB of files to bucket1: peak 140 mbps sent, real time 45s, user time
32s
sync bucket1 to bucket2: peak 60 kbps sent, real time 22s, user
time 2s
Note: 'real' time is wall clock time, 'user' time is CPU time in user mode.
So, there is a significant difference in peak bandwidth used (140mbps vs 60kbps) and in CPU usage (32s vs. 2s). In case #1 we are actually uploading 4 GB of files to S3 but in case #2 we are copying 4 GB of files from one S3 bucket to another without them touching our local machine. The small amount of bandwidth used in case #2 is related to the awscli displaying progress of the sync.
I saw basically identical results when coping objects (aws s3 cp) as when syncing objects (aws s3 sync) between S3 buckets.
Related
I've been trying to download these files all summer from the IRS AWS bucket, but it is so excruciatingly slow. Despite having a decent internet connection, the files start downloading at about 60 kbps and get progressively slower over time. That being said, there are literally millions of files, but each file is very small approx 10-50 kbs.
The code I use to download the bucket is:
aws s3 sync s3://irs-form-990/ ./ --exclude "*" --include "2018*" --include "2019*
Is there a better way to do this?
Here is also a link to the bucket itself.
My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.
Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.
Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.
General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:
How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?
I am uploading a single large file to S3 (> 1GB) and cannot get throughput over 250MB/sec regardless of instance class.
I've checked all the obvious things: I have a VPC endpoint, I am using multipart upload (via AWS CLI aws s3 cp and aws s3 sync), and the instance has ENA enabled.
What's more, I can upload multiple files in parallel and achieve an aggregate bandwidth > 500MB/sec, effectively saturating my EBS volume's throughput, but I cannot get the same bandwidth on a single file.
I do not see any evidence of throttling with --debug output.
Support hasn't found an answer for me yet either, so I'm wondering if there's something obvious I'm missing with a limit on how quickly you can upload to a single object.
I have folders in another System. We are mounting it on to a VM and then periodically we do a sync to Amazon S3 using Amazon S3 client. Each folder has lots of sub folders and total size varies from 1GB upto 40-50GB. My transfer to S3 works fine but I have couple of issues.
Now transfer of 2GB folder takes 4-5 minutes, which is pretty slow. How can I make the transfer of files/folder faster to Amazon S3 bucket.
I see another issue like, if i try to see the size of folder du -sh of a 2GB folder, to see the folder size it takes 8 minutes. Not sure why it takes so much time.
Need advice on making S3 sync with setup I have.
I am uploading 1.8 GB of data that has 500000 of small XML files into the S3 bucket.
When I upload it from my local machine, it takes a very very long time 7 hours.
And when I zipped it and uploaded it takes 5 minutes of time.
But my issue is I can not zip it simply because later on I need to have something in AWS to unzip it.
So is there any way to make this upload faster? Files name are different not running number.
Transfer Acceleration is enabled.
Please suggest me how I can optimize this?
You can always upload the zip file to an EC2 instance then unzip it there and sync it to the S3 bucket.
The Instance Role must have permissions to put Objects into S3 for this to work.
I also suggest you look into configuring an S3 VPC Gateway Endpoint before doing this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html
I need to copy some buckets from one account to another. I got all permissions so I started transferring the data via cli (cp command). I am operating on a c4.large. The problem is that there is pretty much data (9tb) and it goes realy slow. In 20 minutes I transferred like 20gb...
I checked the internet speed and the download is 3000Mbit/s and the upload is 500 Mbit/s. How can I speed up it?
The AWS Command-Line Interface (CLI) aws s3 cp command simply sends the copy request to Amazon S3. The data is transferred between the Amazon S3 buckets without downloading to your computer. Therefore, the size and bandwidth of the computer issuing the command is not related to the speed of data transfer.
It is likely that the aws s3 cp command is only copying a small number of files simultaneously. You could increase the speed by setting the max_concurrent_requests parameter to a higher value:
aws configure set default.s3.max_concurrent_requests 20
See:
AWS CLI S3 Configuration — AWS CLI Command Reference
Getting the Most Out of the Amazon S3 CLI | AWS Partner Network (APN) Blog