Make folders sync to Amazon S3 faster - amazon-web-services

I have folders in another System. We are mounting it on to a VM and then periodically we do a sync to Amazon S3 using Amazon S3 client. Each folder has lots of sub folders and total size varies from 1GB upto 40-50GB. My transfer to S3 works fine but I have couple of issues.
Now transfer of 2GB folder takes 4-5 minutes, which is pretty slow. How can I make the transfer of files/folder faster to Amazon S3 bucket.
I see another issue like, if i try to see the size of folder du -sh of a 2GB folder, to see the folder size it takes 8 minutes. Not sure why it takes so much time.
Need advice on making S3 sync with setup I have.

Related

How to automatically sync s3 bucket to a local folder using windows server

Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/

Download files from Amazon S3 in batches using the AWS Command Line Interface (CLI)

I have one use case where I have around 40 million objects located in an Amazon S3 bucket. I want to download them, process them and re-upload them to another Amazon S3 bucket. First issue, I can't download them all of them at once due to shortage of hard disk space.
I want to download them in batches. For example, I want to download 1 million objects, process them, re-upload them and delete them from local storage. After that, I will download the next 1 million objects and repeat same process.
Is it possible to download files from an Amazon S3 bucket using AWS CLI to download in batches?

How to speed up download of millions of files from AWS S3

I've been trying to download these files all summer from the IRS AWS bucket, but it is so excruciatingly slow. Despite having a decent internet connection, the files start downloading at about 60 kbps and get progressively slower over time. That being said, there are literally millions of files, but each file is very small approx 10-50 kbs.
The code I use to download the bucket is:
aws s3 sync s3://irs-form-990/ ./ --exclude "*" --include "2018*" --include "2019*
Is there a better way to do this?
Here is also a link to the bucket itself.
My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.
Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.
Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.
General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:
How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?

AWS CLI S3 Bucket to Bucket copying?

While performing an aws s3 cp --recursive s3://src-bucket s3://dest-bucket command, will it download the files locally and upload them to the destination bucket? Or (hopefully) will this entire transaction happen on AWS without files ever hitting your instantiating machine?
Thanks
The copy happens within AWS. I verified this as follows using awscli on an Ubuntu EC2 instance:
upload 4GB of files to bucket1: peak 140 mbps sent, real time 45s, user time
32s
sync bucket1 to bucket2: peak 60 kbps sent, real time 22s, user
time 2s
Note: 'real' time is wall clock time, 'user' time is CPU time in user mode.
So, there is a significant difference in peak bandwidth used (140mbps vs 60kbps) and in CPU usage (32s vs. 2s). In case #1 we are actually uploading 4 GB of files to S3 but in case #2 we are copying 4 GB of files from one S3 bucket to another without them touching our local machine. The small amount of bandwidth used in case #2 is related to the awscli displaying progress of the sync.
I saw basically identical results when coping objects (aws s3 cp) as when syncing objects (aws s3 sync) between S3 buckets.

Uploading huge no of files into S3 is very slow

I am uploading 1.8 GB of data that has 500000 of small XML files into the S3 bucket.
When I upload it from my local machine, it takes a very very long time 7 hours.
And when I zipped it and uploaded it takes 5 minutes of time.
But my issue is I can not zip it simply because later on I need to have something in AWS to unzip it.
So is there any way to make this upload faster? Files name are different not running number.
Transfer Acceleration is enabled.
Please suggest me how I can optimize this?
You can always upload the zip file to an EC2 instance then unzip it there and sync it to the S3 bucket.
The Instance Role must have permissions to put Objects into S3 for this to work.
I also suggest you look into configuring an S3 VPC Gateway Endpoint before doing this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html