What is the cheapest way to allow others to download a dataset I have? - amazon-web-services

I have some datasets (can go up to 10 GBs (zipped) altogether possibly) for my Machine Learning applications
In order to expose these datasets to others, I believe I have to host a server and let others to download over the network.
what is the cheapest server I can use for this? (I checked AWS free tiers, can these be used?)
Do I need to write up a web server? is there a premade tool that I can use for my use case?

You haven't indicated how much data will be downloaded (GB/month) and that's important because you pay for data transfer out to the internet (about $0.09 per GB) beyond an initial free amount (1 GB/month, I believe, but check if free tier offers more), and that's relevant to both S3 and EC2.
That said, I'd consider a few options.
Storing the files in S3 and serving them from S3 via CloudFront may be cheaper than running a server 24x7 to host and serve the files.
A small EC2 server that fits into the free tier usage plan, running a web or FTP server, serving up your files.
Similar to #1 but you can also configure requester pays for S3 downloads. This option requires your downloaders to have AWS credentials and for you to manage their access. May not be feasible in your case.
Create an EBS volume containing your data, take a snapshot of that volume, and share the snapshot with other AWS accounts, then shut down your EC2 instance. This option requires your users to be AWS account holders and that they share their AWS account numbers with you. May not be feasible in your case.
AWS SFTP serving up data stored in S3.

Related

S3 AWS download Speed

I'm having troubles with downloading files from S3, If I download a file like 200MB, and then i download another files, the download speed it's just really slow like (40KB/s) as you can see in the follow pic:
And when the first download finish, the second continues with the 40KB/s...
Any ideas about that?
Amazon S3 has huge bandwidth.
If you are downloading from Amazon S3 to your own computer (outside of AWS), then the only limitations that would impact you are your own Internet bandwidth, and any speed limitations imposed within your own network.
I will presume that you are downloading an object from Amazon S3 to an Amazon EC2 instance in the same region as the S3 bucket.
In this scenario, the only bandwidth limitation is the Network Performance on the Amazon EC2 instance. Basically, the bigger the instance, the more bandwidth is available.
In the Launch screen, a t2.large is listed as Low to Moderate Network Performance. This is reasonably good, but not as good as larger instance types.
See:
Amazon EC2 Instance Configuration - Amazon Elastic Compute Cloud
EC2 Network Performance Cheat Sheet | cloudonaut
It might also be a result of the software you are using to download the files and how it multi-tasks and shares network bandwidth between the downloads.
If you're connected to the Internet via WiFi router from your laptop, try to use cable connection instead.

What AWS Service should I use?

We are currently runs in-house hardware that we would like to potentially move to AWS. Our main application uses MySQL on a Linux machine (200GB Disk, 32GB RAM, 4 Cores) serving content to customers through a hardware load balancer (around 1 million unique users per month).
We also use a 500 GB CDN hosted by a third party that we would like to move to AWS potentially. What AWS services would you recommend we look at to achieve comparable functionality and would you have a rough monthly cost estimate?
The main reason we would like to move to AWS would be for cost reduction in hardware and better backup strategies.
Thanks!
1.You can host your application using two or more EC2 instances and you can use elastic load balancer to distribute load amongst these EC2 instances.
2.You could use amazon aurora MySQL(server less) which offers you pay as you go service which will allow to get maximum benefits minimising your cost.It is the very best option for your MySQL database as your users are very high and so as the load on the database.
3.For CDN you could go for aws cloudfront and s3. it offers higher availability to your application and it also has less costing.You just need to make some proper configuration and you are ready to go.
AWS is the best cloud option for you as it provides service for your each problem so can use services according to your use and make most of it.
It also provides very good costing options whcih makes your tasks easy.
Please go through aws docs and costing before you choose aws.
Comparable functionality would be to use AWS RDS to replace your MySQL database, one or more EC2 instances to run your application, and then AWS Load Balancer to distribute the load amongst those EC2 web instances. A combination of S3 and Cloudfront to use as a CDN.
Cost is going to depend on how many ec2 instances you use, and the size and options you use for RDS database(s) plus storage and bandwidth - impossible for me to estimate for you
But you can do your own estimates here: https://awstcocalculator.com/

Best AWS setup for a dedicated FTP server?

My company is looking for a solution for file sharing via FTP - currently, we share one server for client/admin FTP file sharing and serving multiple sites, and are looking to split off our roles so that we have one server dedicated to FTP and one for serving websites.
I have tried to find a good solution with AWS, but cannot find any detailed information regarding EBS and EC2 servers, and whether an EC2 package will be able to handle FTP storage. For example, a T2.nano instance seems ideal with 1 cpu and minimal RAM, but I see no information regarding EBS storage limits.
We need around 500GiB at most, and will have transfers happening daily in the neighborhood of 1GiB in and out. We don't need to run a database or http server. We may run services for file cleanup in the background weekly.
EDIT:
I mis-worded the question, which was founded from a fundamental lack of understanding AWS EC2 and EBS which I now grasp. I know EC2 can run FTP services, the question was more of a cost-effective solution with dynamic storage. Thanks for the input!
As others here on SO will tell you: don't bother with EBS. It can be made to work but does not make much sense in the long run. It's also more expensive and trickier to operate (backups/disaster recovery/having multiple ftp server machines).
Go with S3 storing your files and use something that is able to leverage S3 for ftp (like s3fs)
See:
http://resources.intenseschool.com/amazon-aws-howto-configure-a-ftp-server-using-amazon-s3/
Setting up FTP on Amazon Cloud Server
http://cloudacademy.com/blog/s3-ftp-server/
If FTP is not a strong requirement you can also look at migrating people to using S3 directly (either initially or after you do the setup and give them the option of both FTP and S3 directly)
the question is among the most seen on SO for aws: You can install a FTP server on any EC2 instance type
There's no limit on EBS and you can always increase the storage if you need, so best rule is: start low and increase when needed
Only point to mention is the network performance comes with the instance type so if you care about the speed a t2.nano (low network performance) might not be sufficient

Best setup to work with amazon AWS

I have a website which gets backup from different social media services and then stores the data on server and then that is displayed on my website. content includes, videos, images, and text data.
Currently i am using an EC2 instance with RDS and EBS. Data is stored in EBS Volumes, But as the amount of the data is big enough more than 1 TB and that is increasing. Every time my EBS volume gets filled i attach another volume.
Then i added S3 to my Setup. Cron jobs runs and stores data on S3 and the EC2 instance displays data from the S3. I am using PHP SDK for this purpose.
The problem which i am facing is that the S3 is very slow in my current setup.
Please suggest whether my setup is good or i need some change in my setup and the other way how can i speedup S3. or i should opt some other way to my setup.
EC2 instance is large reserved instance running CentOS.
I have listened some about the S3fs that mount S3 bucket to Ec2 as a volume. Is this a good choice, as when i mounted S3 Bucket to Ec2 instance the transfer rate was very slow.
I am new to the AWS. My users does not access files directly from S3, but they access through my website which is running on EC2 Instance.
RDS is a good choice for storing metadata such as tags, comments and other relevant information about your multimedia files. S3 is good for storing static content such as Video, Audio and Pictures. I think your approach with RDS and S3 is good enough.
EBS backed instances are good for persistence. If you store your metadata on RDS and static content on S3, the only reason why you should use EBS backed EC2 instances is that you have some configuration files which are unversioned right now. If that's not the case, assuming that your configuration is checked into version control and can be pulled on-demand for a fresh instance every time, then you might want to ditch EBS volumes in favor of ephemeral storage. That may give you some performance boost, nothing significant though.
Regarding your concern with S3's latency, yes, S3 is slow. While all your writes may happen directly to S3, I would highly recommend that you set up Amazon CloudFront for your S3 buckets and let your website consume multimedia content from the CloudFront. CloudFront is a Content Delivery Network (CDN) which works with disk volumes (EBS backed or ephemeral) as well as with S3. Setting it up would take not more than a few minutes. CloudFront also supports streaming media files over RTMP. You may need a library like GPAC for hinting multimedia files to make them streamable if not being done already. You might then want to consider creating one distribution for Video/Audio files for streaming and another distribution for Images, Javascript, Stylesheets and other text files.
Hope this helps.
For faster getting and uploading files from Amazon S3 I use batch() found here.
Also you can use cloudfront for faster getting files. I think 9gag uses cloudfront also..

need some guidance on usage of Amazon AWS

every once in a while i read/hear about AWS and now i tried reading the docs.
But such docs seem to be written for people who already know which AWS they need to use and only search for how it can be used.
So, for myself, to understand AWS better i try to sketch a hypothetical Webapplication with a few questions.
The apps purpose is to modify content like videos or images. So a user has some kind of webinterface where he can upload his files, do some settings and a server grabs the file and modifies it (e.g. reencoding). The Service also extracts the audio track of a video and trys to index the spoken words so the customer can search within his videos. (well its just hypothetical)
So my questions:
given my own domain 'oneofmydomains.com' is it possible to host the complete webinterface on AWS? i thought about using GWT to create the interface and just deliver the JS/images via AWS, but which one, simple storage? what about some kind of index.html, is there an EC2 instance needed to host a webserver which has to run 24/7 causing costs?
now the user has the interface with a login form, is it possible to manage logins with an AWS? here i also think about an EC2 instance hosting a database, but it would also cause costs and im not sure if there is a better way?
the user has logged in and uploads a file. which storage solution could be used to save the customers original and modified content?
now the user wants to browse the status of his uploads, this means i need some kind of ACL, so that the customer only sees his own files. do i need to use a database (e.g. EC2) for this, or does amazon provide some kind of ACL, so the GWT webinterface will be secure without any EC2?
the customers files are reencoded and the audio track is indexed. so he wants to search for a video. Which service could be used to create and maintain the index for each customer?
hope someone can give a few answers so i understand AWS better on how one could use it
thx!
Amazon AWS offers a whole ecosystem of services which should cover all aspects of a given architecture, from hosting to data storage, or messaging, etc. Whether they're the best fit for purpose will have to be decided on a case by case basis. Seeing as your question is quite broad I'll just cover some of the basics of what AWS has to offer and what the different types of services are for:
EC2 (Elastic Cloud Computing)
Amazon's cloud solution, which is basically the same as older virtual machine technology but the 'cloud' offers additional knots and bots such as automated provisioning, scaling, billing etc.
you pay for what your use (by hour), for the basic (single CPU, 1.7GB ram) would prob cost you just under $3 a day if you run it 24/7 (on a windows instance that is)
there's a number of different OS to choose from including linux and windows, linux instances are cheaper to run without the license cost associated with windows
once you're set up the server to be the way you want, including any server updates/patches, you can create your own AMI (Amazon machine image) which you can then use to bring up another identical instance
however, if all your html are baked into the image it'll make updates difficult, so normal approach is to include a service (windows service for instance) which will pull the latest deployment package from a storage (see S3 later) service and update the site at start up and at intervals
there's the Elastic Load Balancer (which has its own cost but only one is needed in most cases) which you can put in front of all your web servers
there's also the Cloud Watch (again, extra cost) service which you can enable on a per instance basis to help you monitor the CPU, network in/out, etc. of your running instance
you can set up AutoScalers which can automatically bring up or terminate instances based on some metric, e.g. terminate 1 instance at a time if average CPU utilization is less than 50% for 5 mins, bring up 1 instance at a time if average CPU goes beyond 70% for 5 mins
you can use the instances as web servers, use them to run a DB, or a Memcache cluster, etc. choice is yours
typically, I wouldn't recommend having Amazon instances talk to a DB outside of Amazon because of the round trip is much longer, the usual approach is to use SimpleDB (see below) as the database
the AmazonSDK contains enough classes to help you write some custom monitor/scaling service if you ever need to, but the AWS console allows you to do most of your configuration anyway
SimpleDB
Amazon's non-relational, key-value data store, compared to a traditional database you tend to pay a penalty on per query performance but get high scalability without having to do any extra work.
you pay for usage, i.e. how much work it takes to execute your query
extremely scalable by default, Amazon scales up SimpleDB instances based on traffic without you having to do anything, AND any control for that matter
data are partitioned in to 'domains' (equivalent to a table in normal SQL DB)
data are non-relational, if you need a relational model then check out Amazon RDB, I don't have any experience with it so not the best person to comment on it..
you can execute SQL like query against the database still, usually through some plugin or tool, Amazon doesn't provide a front end for this at the moment
be aware of 'eventual consistency', data are duplicated on multiple instances after Amazon scales up your database, and synchronization is not guaranteed when you do an update so it's possible (though highly unlikely) to update some data then read it back straight away and get the old data back
there's 'Consistent Read' and 'Conditional Update' mechanisms available to guard against the eventual consistency problem, if you're developing in .Net, I suggest using SimpleSavant client to talk to SimpleDB
S3 (Simple Storage Service)
Amazon's storage service, again, extremely scalable, and safe too - when you save a file on S3 it's replicated across multiple nodes so you get some DR ability straight away.
you only pay for data transfer
files are stored against a key
you create 'buckets' to hold your files, and each bucket has a unique url (unique across all of Amazon, and therefore S3 accounts)
CloudBerry S3 Explorer is the best UI client I've used in Windows
using the AmazonSDK you can write your own repository layer which utilizes S3
Sorry if this is a bit long winded, but that's the 3 most popular web services that Amazon provides and should cover all the requirements you've mentioned. We've been using Amazon AWS for some time now and there's still some kinks and bugs there but it's generally moving forward and pretty stable.
One downside to using something like aws is being vendor locked-in, whilst you could run your services outside of amazon and in your own datacenter or moving files out of S3 (at a cost though), getting out of SimpleDB will likely to represent the bulk of the work during migration.