Syncing remote folders from several machines to one AWS instance - amazon-web-services

I have 3 AWS P instances processing some heavy stuff and saving results to relevant /home/user/folder
Also I have a main server with the same folder where I want to collect results from those 3 instances
Each instance works on its own part of the whole task, their results in sub folders not overlapping
Instances are 2 TB each, so I would like to get results from each instance as soon as they appear
This way when its job is done, I won't spend half a day copying results to the main server
I think one way of solving this is running something like this on each instance:
*/30 * * * * rsync /home/user/folder ubuntu#1.1.1.1:/home/user/folder
Are there any other more smart ways of achieving same results given that all of instances are AWS?
I also thought about (1) detachable storage and (2) storing on S3 but being new to AWS I might overlook some hidden pitfalls in such workflows, especially when it comes to terabytes of data and expensive instances.
How do you collect processed data from remote instances?

I would consider using rclone tool, which can be easy configured for the shared S3 bucket. Just be aware about copy/sync mode. It can rich up to several Gigabit throughput depending on your instance type.
Link for the project: rclone.org

My thoughts on some of the options mentioned in OP and comments, as well as some other ones I thought of:
EFS: create an EFS and mount it as an NFS drive on all the instances. It's the easiest but probably costs the most.
s3fs: have all the instances mount the same S3 bucket using s3fs. This is likely the most inexpensive solution. You also don't need to worry about running out of disk space. The downside is that the performance is not going to be that good compared to mounted NFS drives.
EBS volumes: attach an EBS volume to each worker instance for them to write the results to. When they are done, detach the volumes and attach them to the main server. This will be the fastest and still cheaper than EFS. If you can't or won't do all the detaching/attaching manually you'll need to write some scripts.
Old school NFS shares: there is nothing wrong with a plain vanilla NFS setup without any of those fancy AWS acronyms. :-)

Related

Loading large amount of data from local machine into Amazon Elastic Block Store

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one
In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

choosing a hosting platform that allows file and directory creation

I am trying to launch a project where my server generates user files and directories. Since heroku doesn't allow that, i am trying to find the best platform that will fit my needs without changing a bunch of my code.
my node server is storing data to firebase along with some files on the server itself. I realize this is not best practice but it is what it is for now
What would you recommend?
You can store your objects in S3. Do not store files on VMs in case of any failure.
Depending on your needs, an EBS volume would be a good start. It is meant to be redundant and the chances of losing any data is very small. The advantage is that it lives on if you terminate or stop an instance.
The newer EFS is very fast and can be mounted to multiple machines, much like an NFS file system. It is redundant across availability zones and will also survive a machine stop/termination.
S3 is an object store and isn't really meant for file system I/O. It can easily store files but it doesn't have nearly the performance of either EBS or EFS. It lives on after machine termination - indeed, it can be accessed with HTTP when properly configured.
Ultimately, you can create files normally on the EC2 with instance store, EBS, or EFS. The instance store data is lost if you terminate or even stop the instance. Be careful with that - you can easily lose tons of data when it is on instance store and not properly backed up.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Process data in AWS S3 from EC2 instance

I'm wondering what is the best way of processing huge amounts of images stored in AWS S3 buckets from an Ec2 instance located in the same availability zone.
Should I download the images that I need each time I have to process them and then delete when I'm done, and do the same thing every time I need to do some processing?
Or is there a better way, like mounting the S3 bucket into the EC2 instance? I have seen tools like Fuse for mounting, but I am not sure if this is the best way of processing the data.
First of all. Note that each EC2 instance can be killed, so keep data, and results at reasonable storage - like S3.
If you fetch whole image into memory, and then processing goes. I can't see needs for fetching to disk. On the other hand if image is quite big - you could fetch each part many times. So there is no easy answer, at least with out more information.
You can look at map reduce solutions. How they are dealing with keeping data close to processing unit. Spark is able to process things in memory.
About mounting resources. There are other options like Elastic File System, or Elastic Block Storage - that can be mounted.