Loading large amount of data from local machine into Amazon Elastic Block Store - amazon-web-services

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one

In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

Related

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Process data in AWS S3 from EC2 instance

I'm wondering what is the best way of processing huge amounts of images stored in AWS S3 buckets from an Ec2 instance located in the same availability zone.
Should I download the images that I need each time I have to process them and then delete when I'm done, and do the same thing every time I need to do some processing?
Or is there a better way, like mounting the S3 bucket into the EC2 instance? I have seen tools like Fuse for mounting, but I am not sure if this is the best way of processing the data.
First of all. Note that each EC2 instance can be killed, so keep data, and results at reasonable storage - like S3.
If you fetch whole image into memory, and then processing goes. I can't see needs for fetching to disk. On the other hand if image is quite big - you could fetch each part many times. So there is no easy answer, at least with out more information.
You can look at map reduce solutions. How they are dealing with keeping data close to processing unit. Spark is able to process things in memory.
About mounting resources. There are other options like Elastic File System, or Elastic Block Storage - that can be mounted.

amazon ec2 free server with persistent data

I will install a website in the free EC2 from amazon but I read something not good: I have a simple website which uses a database. Users come inside my website and post information, send commetns... if for some reason the instance breaks or amazon shuts it down, will I lose all information posted in my website and database? All files users uploaded and information saved will be gone?
If so, why would someone use EC2 if you lose all your data if some problem happens, and because problems always happen, sometime I will certainly lose my data!
I know I can save an image of my current OS in AWS but do I need to save the image everytime a user posts something to my website? It's ridiculous. I know I am missing something here, but I looked into google and people all the time say I should use EBS but it's not in the free plan. So how is it good idea using AWS EC2 free plan if my data will always be at risk of being lost?
Typically you would want to use an EBS backed instance. Since the free tier does not support that, but does offer EBS storage, create your database on an EBS partition for data you cannot lose
30 GB of Amazon Elastic Block Storage, plus 2 million I/Os and 1 GB of snapshot storage*
http://aws.amazon.com/free/
You should have a means to quickly launch a new instance, and you should back up the data on your EBS partition because EBS volumes can and do fail from time to time.
UPDATE
It seems that Micro instances are in fact EBS backed.
It is still advisable to attach a separate EBS volume, because it makes it much more convenient to backup the database (you create a snapshot of the EBS volume... you can find scripts online to accomplish that, which vary a bit depending on your choice of database and file system).

Storing data on EC2 vs S3 in AWS

I'm fairly new to AWS and am having trouble understanding what the best practice should be for hosting a product I'm developing...
I have a tool I plan on running on an EC2 instance but it only needs to run a couple of times a year, the rest of the time I can have the instance stopped and not incur charges. However the product utilizes quite a bit of data (approx 15 - 25 GB depending on the run). I understand that S3 is meant for storing data long term. But is there any reason why I can't just leave the data on an EC2 (even when it's stopped). Or do I have to do manual copies from S3 every time I want to execute a run
Even if the instance is off, you are still incurring charges for the EBS volumes. S3 also has the advantage of being somewhat more durable.
It would probably be a good idea to back up your data or snapshot your volume to s3 as a precaution, but I would not worry about transferring data back and forth.