I've been trying to download these files all summer from the IRS AWS bucket, but it is so excruciatingly slow. Despite having a decent internet connection, the files start downloading at about 60 kbps and get progressively slower over time. That being said, there are literally millions of files, but each file is very small approx 10-50 kbs.
The code I use to download the bucket is:
aws s3 sync s3://irs-form-990/ ./ --exclude "*" --include "2018*" --include "2019*
Is there a better way to do this?
Here is also a link to the bucket itself.
My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.
Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.
Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.
General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:
How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?
I need to copy some buckets from one account to another. I got all permissions so I started transferring the data via cli (cp command). I am operating on a c4.large. The problem is that there is pretty much data (9tb) and it goes realy slow. In 20 minutes I transferred like 20gb...
I checked the internet speed and the download is 3000Mbit/s and the upload is 500 Mbit/s. How can I speed up it?
The AWS Command-Line Interface (CLI) aws s3 cp command simply sends the copy request to Amazon S3. The data is transferred between the Amazon S3 buckets without downloading to your computer. Therefore, the size and bandwidth of the computer issuing the command is not related to the speed of data transfer.
It is likely that the aws s3 cp command is only copying a small number of files simultaneously. You could increase the speed by setting the max_concurrent_requests parameter to a higher value:
aws configure set default.s3.max_concurrent_requests 20
See:
AWS CLI S3 Configuration — AWS CLI Command Reference
Getting the Most Out of the Amazon S3 CLI | AWS Partner Network (APN) Blog
I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).
I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.
I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.
I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?
Update
I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.
You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.
aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html
40100 objects 160gb was copied/sync in less than 90 seconds
follow the below steps:
step1- select the source folder
step2- under the properties of the source folder choose advance setting
step3- enable transfer acceleration and get the endpoint
AWS configurations one time only (no need to repeat this every time)
aws configure set default.region us-east-1 #set it to your default region
aws configure set default.s3.max_concurrent_requests 2000
aws configure set default.s3.use_accelerate_endpoint true
options :-
--delete : this option will delete the file in destination if its not present in the source
AWS command to sync
aws s3 sync s3://source-test-1992/foldertobesynced/ s3://destination-test-1992/foldertobesynced/ --delete --endpoint-url http://soucre-test-1992.s3-accelerate.amazonaws.com
transfer acceleration cost
https://aws.amazon.com/s3/pricing/#S3_Transfer_Acceleration_pricing
they have not mentioned pricing if buckets are in the same region
As a variant of what OP is already doing..
One could create a list of all files to be synced, with aws s3 sync --dryrun
aws s3 sync s3://source-bucket s3://destination-bucket --dryrun
# or even
aws s3 ls s3://source-bucket --recursive
Using the list of objects to be synced, split the job into multiple aws s3 cp ... commands. This way, "aws cli" won't be just hanging there, while getting a list of sync candidates, as it does when one starts multiple sync jobs with --exclude "*" --include "1?/*" type arguments.
When all "copy" jobs are done, another sync might be worth it, for good measure, perhaps with --delete, if object might get deleted from "source" bucket.
In case of "source" and "destination" buckets located in different regions, one could enable cross-region bucket replication, before starting to sync the buckets..
New option in 2020:
We had to move about 500 terabytes (10 million files) of client data between S3 buckets. Since we only had a month to finish the whole project, and aws sync tops out at about 120megabytes/s... We knew right away this was going to be trouble.
I found this stackoverflow thread first, but when I tried most of the options here, they just weren't fast enough. The main problem is they all rely on serial item-listing. In order to solve the problem, I figured out a way to parallelize listing any bucket without any a priori knowledge. Yes, it can be done!
The open source tool is called S3P.
With S3P we were able to sustain copy speeds of 8 gigabytes/second and listing speeds of 20,000 items/second using a single EC2 instance. (It's a bit faster to run S3P on EC2 in the same region as the buckets, but S3P is almost as fast running on a local machine.)
More info:
Bog post on S3P
S3P on NPM
Or just try it out:
# Run in any shell to get command-line help. No installation needed:
npx s3p
(requirements nodejs, aws-cli and valid aws-cli credentials)
Background: The bottlenecks in the sync command is listing objects and copying objects. Listing objects is normally a serial operation, although if you specify a prefix you can list a subset of objects. This is the only trick to parallelizing it. Copying objects can be done in parallel.
Unfortunately, aws s3 sync doesn't do any parallelizing, and it doesn't even support listing by prefix unless the prefix ends in / (ie, it can list by folder). This is why it's so slow.
s3s3mirror (and many similar tools) parallelizes the copying. I don't think it (or any other tools) parallelizes listing objects because this requires a priori knowledge of how the objects are named. However, it does support prefixes and you can invoke it multiple times for each letter of the alphabet (or whatever is appropriate).
You can also roll-your-own using the AWS API.
Lastly, the aws s3 sync command itself (and any tool for that matter) should be a bit faster if you launch it in an instance in the same region as your S3 bucket.
As explained in recent (May 2020) AWS blog post tiled:
Replicating existing objects between S3 buckets
Once can also use S3 replication for existing objects. This requires contacting AWS support to enable this feature:
Customers can copy existing objects to another bucket in the same or different AWS Region by contacting AWS Support to add this functionality to the source bucket.
I used Datasync to migrate 95 TB of data. Took about 2 days. Has all this fancy things for network optimization, parallelization of the jobs. You can even have checks on source and destination to be sure everything transfered as expected.
https://aws.amazon.com/datasync/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
I'm one of the developers of Skyplane, which can copy data across buckets at over 110X speed compared to cloud CLI tools. You can sync two buckets with:
skyplane sync -r s3://bucket-1/ s3://bucket-2/
Underneath the hood, Skyplane creates ephemeral VM instances which parallelize syncing the data across multiple machines (so you're not bottlenecked by disk bandwidth)
Copying an S3 bucket to another bucket is too slow using the following command:
aws s3 cp s3://bucket1 s3://bucket2 --recursive
Using awscli. But this is too slow. Because my bucket1 contains too many files. Only 3,000,000 images copy in 12 hours.
Looking for recommendations to help expedite the copy process.
You can use this script with multi-threads:
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py
As suggested by before, This methods mentioned is the recommended practices to transfer huge amount of data from source to destination.
S3 Optimized Transfer
Plus you can also Enable the Amazon S3 Transfer Acceleration feature on the bucket(It'll incur some charges), but it'll accelerate your transfer speed upto 100%
Amazon S3 Transfer Accelaration
Is there a way how to copy all files from S3 to an EBS drive belonging to a EC2 instance (which may belong to a a different AWS account than the S3)?
We are performing a migration of the whole account and upgrading the instances from t1 to t2 type and would like to backup the data from S3 somewhere outside S3 (and Glacier since Glacier is closely linked to S3) in case that something goes wrong and we lose the data.
I found only articles and docs talking about EBS snapshots but I am not sure if the S3 data can be actually copied to EBS (in some other way than manually).
According to this docs, I can ssh to my instance and copy the data from S3 buckets to my local EBS drive, but I have to specify the name of the bucket. Is there a way how to copy all the buckets there?
aws s3 sync s3://mybucket
I would like to achieve this:
Pseudocode:
for each bucket
do
aws s3 sync s3://bucketName bucketName
endfor
Is there a way how to do this using the AWS CLI?
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year and achieves this by automatically replicating the data you put into a bucket across 3 separate facilities (think datacenters), across Availability Zones, within a Region. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.
If you are still concerned about losing your data, you may consider copying the contents of the buckets into new buckets set up in another region. That means that you have your data in 1 system that offers 11x9's with a copy in another system that offers 11x9's. Say your original buckets reside in the Dublin region, create corresponding 'backup' buckets in the Frankfurt region and use the sync command.
eg.
aws s3 sync s3://originalbucket s3://backupbucket
That way you will have six copies of your data in six different facilities spread across Europe (naturally this is just as relevant if you use multiple regions in the US or ASIA). This would be a much more redundant configuration than pumping it into EBS volumes that have a meagre (when compared to S3) 99.999% availability. And better economics with S3 rates lower than EBS (1TB in S3 = US$30 vs 1TB in EBS(Magnetic) = US$50) and you only pay for the capacity you consume whereas EBS is based on what you provision.
Happy days...
References
http://aws.amazon.com/s3/faqs/
http://aws.amazon.com/ebs/faqs/
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
I would agree with rdp-cloud's answer, but if you insist on creating EBS backups, to answer your question - there is no single aws cli command that would sync all available buckets in one go. You can use a bash script to get the list of all available buckets and then sync looping through them:
#!/bin/bash
BUCKETS=($(aws s3 ls | awk '{print $3}'))
for (( i=0; i<${#BUCKETS[#]}; i++ ))
do
aws s3 sync s3://$BUCKETS[$i] <destination>
done
Make sure to test that aws s3 ls | awk '{print $3}' gives you the exact list you intend to sync before running the above.