download a large amount of files from EC2 - amazon-web-services

I have lots (10 million) of files (some 20K folders, each folder with about 500 files) on an EC2 EBS drive of 1TB.
I'f like to download it to my PC, how would I do that most efficiently.
Currently I'm using rsync, but it's taking AGES (about 3MB/s, when my ISP is 10MB/s).
Maybe I should use some tool to send it to S3 and then download it from there?
How would I do that, while preserving the directory structure?

The most efficient way would be to get a disc/drive sent there and back. Even today, for large sizes (>= 1 TB), snail mail is the fastest & most efficient way to send data back and forth
http://aws.amazon.com/importexport/

S3 and parallel HTTP downloads can help, but you can also use other download acceleration tools directly from your EC2 instance, such as Tsunami UDP or Aspera

Related

Efficient way to upload huge number of small files in S3

I'm encoding dash streams locally that I intend to stream through Cloudfront after, but when it comes to uploading the whole folder it get counted as +4000 PUT requests. So, I thought instead to compress it and upload the zip folder that would count as only 1 PUT request, and then Unzip it using lambda.
My question is, is lambda still going to use the PUT requests for unzipping the file ? And if so, what would be a better/cost effective way to achieve this ?
No, there is no way around having to pay for the individual PUT/POST requests per-file.
S3 is expensive. So is anything related to video streaming. The bandwidth and storage costs will eclipse your HTTP request costs. You might consider a more affordable provider. AWS is the highest price out of all that do S3-compatible hosting.

How much would AWS ec2 cost for a project of my type

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."
There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.
If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

I have a 50gb vdi image, What's the best way to host it for distribution?

I have a virtual machine disk and need to host it for distribution within my company. What's the best way to do so?
Can I put it on AWS S3?
Yes, you can host your file on S3. However, there are several factors to consider.
Cost:
Data transfer out (download) pricing on S3 is apx $.09 per GB. This translate to about $4.50 for each download.
Transfer Reliability:
Downloading a 50 GB file may be problematic for some customers. I would use a zip tool and split the file into 2 - 5 GB parts to make downloading easier. You don't want to pay for a 47 GB download that failed and then the user had to start the download over again.
Security:
If you make the download file public, anyone can download it and you pay. Use presigned URLs or signed cookies to control who can download the file. Since the file is for internal company use, you can restict access to IP (CIDR) block ranges.
Performance:
Where are your users located? This will help you determine where to host the bucket(s) for downloads. You might consider CloudFront, but for files this large, they won't get cached unless you split the file into smaller pieces.

Sync data between EC2 instances

While I'm looking to move our servers to AWS, I'm trying to figure out how to sync data between our web nodes.
I would like to mount a disk on every web node and have a local cache of the entire share.
Are there any preferred ways to do this?
Sounds like you should consider storing your files on s3 originally and if performance is key, have a sync job that pulls copies of the files locally to your ec2 instance. S3 is fast, durable and cheap - maybe even fast enough without keeping a local cache - but if you do indeed need a local copy, there are tools such as the aws cli and other 3rd party tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Depending on what you are trying to sync - take a look at
http://aws.amazon.com/elasticache/
It is an extremely fast and efficient method for sharing data.
One absolute easy solution is to install Dropbox sync client on both the machines and keep your files in Dropbox. This is by far the easiest !
In this approach, you can "load" data to the machines, using externally adding files to your dropbox account (not even go to AWS service to load) - From another machine or even from browser interface of Dropbox.

Transferring lots of small files between EC2 and Amazon S3

I'm building a browser game and I have a lot of small files that need to be transfered between my EC2 instancce and S3 when players perform some key actions.
Although transferring a single big file is fairly fast, transferring multiple small files is extremely slow. I'm using Amazon's PHP SDK.
Is there a way to overcome this weakness in S3? Thanks.
It looks like combining the two solutions below is the way to go.
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
http://gearman.org/
If this transfer has to be made from EC2 instance to S3 then may be you can try using s3fuse , which will basically mount your s3 drive to storage volume of EC2 instance.
The performance of S3 is not constant and can be quite slow sometimes. If you need real-time performance for a shared object I would take a look at the AWS memcached service although I have not used it.
How exactly are you uploading files? is there a multithreaded method in the SDK? I'm asking because I've had to implement my own method for downloading stuff faster than the SDK.
Do you need to read those files right away? how many events do you have per second, do you need them ordered?
My first thought would be to make a local buffer that uploads batches every once in a while.
Then, if that's too slow, I'd store them in a fast buffer first, instead of S3, and flush it every once in a while. My choices would be simple stuff like SQS or Redis. SQS has theoretically unlimited throughput for random queues and 300 batches per second (1 batch = 1..10 messages = 0..256kb) for FIFO queues - which you can further increase.
Then you have streams, Lambda and whatever.