Fastest way to download large files from AWS EC2 EBS - amazon-web-services

Suppose I have a couple of terabytes worth of data files that have accumulated on an EC2 instance's block storage.
What would be the most efficient way of downloading them to a local machine? scp? ftp? nfs? http? rsync? Going through an intermediate s3 bucket? Torrent via multiple machines? Any special tools or scripts out there for this particular problem?

As I did not really receive a convincing answer, I decided to make a small measurement myself. Here are the results I got:
More details here.

Please follow these rules:
Move as one file, tar everything into a single archive file.
Create S3 bucket in the same region as your EC2/EBS.
Use AWS CLI S3 command to upload file to S3 bucket.
Use AWS CLI to pull the file to your local or wherever another storage is.
This will be the easiest and most efficient way for you.

Some more info about this usecase is needed. I hope below concepts are helpfull:
HTTP - fast, easy to implement, versatile and has small overhead.
Resilio (formerly BitTorrent Sync) - fast, easy to deploy, decentralized, and secure. Can handle transfer interruptions. Works if both endpoints are behind NAT.
rsync - old school and well known solution. Can resume transfer and fast in syncing big amounts of data.
Upload to S3 and get from there - Upload to S3 is fast. Next You can use HTTP(S) or BitTorrent to get data localy.

Related

File transfer from basespace to aws S3

I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

Amazon equivalent of Google Storage Transfer Service

I have a bucket in GCP that has millions of 3kb files, and I want to copy them over to an S3 bucket. I know google has a super fast transfer service, however I am not able to use that solution to push data back to S3 with it.
Due to the amount of objects, running a simple gsutil -m rsync gs://mybucket s3://mybucket might not do the job because it will take at least a week to transfer everything.
Is there a faster solution than this?
On the AWS side, you may want to see if S3 Transfer Acceleration would help. There are specific requirements for enabling it and naming it. You would want to make sure the bucket was in a location close to where the data is currently stored, but that might help speed things up a bit.
We got the same problem of pushing small files to S3. Compressing and storing it back does the same thing. It is the limits set to your account.
As mentioned in the documentation you need to open support ticket to increase your limits before you send burst of requests.
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
It is NOT size of the file or size of all objects matters here. It is the number of files you have is the problem.
Hope it helps.
Personally I think the main issue that you're going to have is not so much the ingress rate to Amazon's S3 service but more so the network egress rate from Google's Network. Even if you enabled the S3 Transfer Acceleration service, you'll still be restricted by the egress speed of Google's Network.
There are other services that you can set up which might assist in speeding up the process. Perhaps look into one of the Interconnect solutions which allow you to set up fast links between networks. The easiest solution to set up is the Cloud VPN solution which could allow you to set up a fast uplink between an AWS and Google Network (1.5-3 Gbps for each tunnel).
Otherwise from your data requirements the transfer of 3,000 GB isn't a terrible amount of data and setting up a Cloud server to transfer data over the space of a week isn't too bad. You might find that by the time you set up another solution it may have been easier in the first place to just spin up a machine and let it run for a week.

Uploading Directly to S3 vs Uploading Through EC2

Im developing a mobile app that will use AWS for its backend services. In the app I need to upload video files to S3 on a frequent basis, and I'm wondering what the recommended architecture would look like to make this scalable and efficient. Traffic could be high, and file sizes could be large.
-On one hand, I could upload directly to S3 using the S3 API on the client side. This would be the easiest option, but Im not sure of the negative implications associated with it.
-The other way to do it would be to go through an EC2 instance and handle the request using some PHP scripts and upload from there.
So my question is... Are these two options equal, or are there major drawbacks to one of them opposed to another? I will already have EC2 instances configured for database access if that makes any difference in how you approach the question.
I will recommend using "upload directly to S3 using the S3 API on the client side" as you can speed up the upload process by using AWS S3 part upload as your video files are going to large.
The second method will put extra CPU usage load on your EC2 instance as the script processing and upload to S3 will utilize CPU for the process.

what is the best way to download the contents of an AWS EBS volume?

I have a number of large (100GB-400GB) files stored on various EBS volumes in AWS. I need to have local copies of these files for offline use. I am wary to attempt to scp such files down from AWS considering their size. I've considered cutting the files up into smaller pieces and reassembling them once they all successfully arrive. But I wonder if there is a better way. Any thoughts?
There are multiple ways, here are some:
Copy your files to S3 and download them from there. S3 has a lot more support in the backend for downloading files (It's handled by Amazon)
Use rsync instead of scp. rsync is a bit more reliable than scp and you can resume your downloads.
rsync -azv remote-ec2-machine:/dir/iwant/to/copy /dir/where/iwant/to/put/the/files
Create a private torrent for your files. If your using Linux mktorrent is a good utility you can use: http://mktorrent.sourceforge.net/
Here is one more option you can consider if you are wanting to transfer large amounts of data:
AWS Import/Export is a service that accelerates transferring data into and out of AWS using physical storage appliances, bypassing the Internet. AWS Import/Export Disk was originally the only service offered by AWS for data transfer by mail. Disk supports transfers data directly onto and off of storage devices you own using the Amazon high-speed internal network.
Basically from what I understand, you send amazon your HDD and they will copy the data onto it for you and send it back.
As far as I know this is only available in USA but it might have been expanded to other regions.