What is difference between AWS EFS and S3? - amazon-web-services

AWS releases new Elastic File System this week. See http://aws.amazon.com/efs/
The page doesn't contain many details. I'd like to know its performance comparing to S3, as well as other differences.

You almost can't compare EFS and S3 because they are two very different things, even though there is some overlap in their functionality, or at least their apparent functionality.
They both store things and they both have a storage pricing model that scales linearly with usage over time.
But S3 is an object store with an HTTP interface and a mixed consistency model....
...while EFS is an actual filesystem with an NFS interface and as such will almost certainly offer immediate consistency.
S3, coupled with a utility like s3fs can be used in a way that mimics a filesysem, but not to the point of behaving in all ways like an actual filesystem.
One way of looking at EFS is that it is an answer to the question, "how do I attach an EBS volume to multiple instances at the same time?" Previously, of course, the answer was, "you can't." You can mount the filesystem exposed by EFS on any nunber of instances and the result should be very similar to what you'd see if you had a "shared volume."
Its performance compared to S3 is not really a fair comparison, again, because they are different things for different purposes, but EFS will almost without question be "faster" by any meaningful definition of the word.
Also, no software should be required in order to mount an EFS filesystem on a Linux system.

As already mentioned EFS is completely different to S3.
The simplest way to look at is to look at what the underlying technology is.
S3 is an object store, meaning it is a higher layer data storage system, essentially it is a database "blob" storage, storing data in an underlying simple database as an object.
It's designed for Write once Read many access, perfect for media data like image or video particularly as it is distributed and offers a very high level of redundancy.
EFS is a Network Storage system, underlying it is a storage array (SAN) and it offers the standard protocol for multi session network file systems (NFS)
It's built on high speed SSD drives and is intended for shared storage for your ec2 instances, think file servers.
It's been a long time coming for AWS and IMO this was one of the biggest missing key components for aws to really be a competitor to on-premise enterprise data centers.
Performance for EFS will be scalable and although I have not seen the details yet I am sure it will allow for provisioned IOPS just like EBS.

EFS is also considerably (10x) more expensive than S3 at $0.30 vs $0.03. From an IOPs perspective you should see better performance from EFS as it's SSD based and doesn't have the overheard of HTTP on top as does S3. It's essential NAS as a Service.

Two addition differences between the two:
AWS S3 offers Server-Side Encryption: http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html
The same is not currently offered in AWS EFS
Files stored in AWS S3 as public, are accessible via a public URL to anyone.
In AWS EFS however, in order to achieve the same, you'll need to deploy a web server that will serve your files.

Choosing between EFS and S3 is depend on your usage pattern
EFS availability and durability is same as S3
but both have different usage patterns
S3 have four common usage patterns:
static web content
host entire static websites
store data for large-scale analytics.
backup and archiving of critical data.
EFS is designed for applications thats concurrently access data from multiple EC2 instances.
simply, by having one EFS you can attach it to multiple EC2.. you can't do that with EBS.
Amazon claim that S3 performance is more than any current users needs.
EFS performance has two modes
General Purpose
Max I/O
General Purpose is the default and it's appropriate for most operations type.
but, if your workload will exceed 7000 file operations per second then Max I/O is your target

Related

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Comparative Application ebs vs s3

I am new to cloud environment. I do understand the definition and storage types EBS and S3. I wanted to understand the application of EBS as compared to S3.
I do understand EBS looks like a device for heavy though put operations. I cannot find any application where this can be used in comparison to S3. I could think of putting server logs on EBS on magnetic storage, as one EBS can be attached to one instance.
S3 you can use the scaling property to add some heavy data and expand in real time. We can deploy our slef managed dbs on this service.
Please correct me if I am wrong. Please help me understand what is best suited for what and application of them in comparison with one another.
As you stated, they are primarily different types of storage:
Amazon Elastic Block Store (EBS) is a persistent disk-storage service, which provides storage volumes to a virtual machine (similar to VMDK files in VMWare)
Amazon Simple Storage Service (S3) is an object store system that stores files as objects and optionally makes them available across the Internet.
So, how do people choose which to use? It's quite simple... If they need a volume mounted on an Amazon EC2 instance, they need to use Amazon EBS. It gives them a C:, D: drive, etc in Windows and a mountable volume in Linux. Computers traditionally expect to have locally-attached disk storage. Put simply: If the operating system or an application running on an Amazon EC2 instance wants to store data locally, it will use EBS.
EBS Volumes are actually stored on two physical devices in case of failure, but an EBS volume appears as a single volume. The volume size must be selected when the volume is created. The volume exists in a single Availability Zone and can only be attached to EC2 instances in the same Availability Zone. EBS Volumes persist even when the attached EC2 instance is Stopped; when the instance is Started again, the disk remains attached and all data has been presrved.
Amazon S3, however, is something quite different. It is a storage service that allows files to be uploaded/downloaded (PutObject, GetObject) and files are replicated across at least three data centers. Files can optionally be accessed via the Internet via HTTP/HTTPS without requiring a web server. There are no limits on the amount of data that can be stored. Access can be granted per-object, per-bucket via a Bucket Policy, or via IAM Users and Groups.
Amazon S3 is a good option when data needs to be shared (both locally and across the Internet), retained for long periods, backed-up (or even for storing backups) and made accessible to other systems. However, applications need to specifically coded to use Amazon S3 and many traditional application expect to store data on a local drive rather than on a separate storage service.
While Amazon S3 has many benefits, there are still situations where Amazon EBS is a better storage choice:
When using applications that expect to store data locally
For storing temporary files
When applications want to partially update files, because the smallest storage unit in S3 is a file and updating a part of a file requires re-uploading the whole file
For very High-IO situations, such as databases (EBS Provisioned IOPS can provide volumes up to 20,000 IOPS)
For creating volume snapshots as backups
For creating Amazon Machine Images (AMIs) that can be used to boot EC2 instances
Bottom line: They are primarily different types of storage and each have their own usage sweet-spot, just like a Database is a good form of storage depending upon the situation.

How to Share a storage between multiple Amazon EC2 instances?

How to share S3 storage between multiple EC2 instances? I am beginner to AWS, I need to know how to share a drive between multiple EC2 instances.
Currently you can't, and S3 is your best bet, but AWS does have their Elastic File System in BETA currently, and there is the possibility it will be available for general availability anytime (I have no inside knowledge, just a guess - maybe even this week, they often have lots of announcements during their annual conference going on now).
You can signup for 'preview' access and see if it suits your needs, and then decide if you can wait for it to become fully available.
AWS EFS will allow you to share a drive between instances:
Amazon EFS supports the Network File System version 4 (NFSv4)
protocol, so the applications and tools that you use today work
seamlessly with Amazon EFS. Multiple Amazon EC2 instances can access
an Amazon EFS file system at the same time, providing a common data
source for workloads and applications running on more than one
instance.
https://aws.amazon.com/efs/
EFS (still in beta, half a year later) indeed looks like the best option. But as EFS is basically just a managed, highly available NFS server, it should be possible to roll out some other NFS solution first, and replace it with EFS once it's finally available.
One promising candidate seems dCache, which is
a system for storing and retrieving huge amounts of data, distributed
among a large number of heterogenous server nodes, under a single
virtual filesystem tree with a variety of standard access methods.
It is used by research institutions all over the world to store over 100PB of data, and it provides an NFSv4 interface. Not sure how easy setup on AWS would be, or what the performance would be like.
https://www.dcache.org/

AWS EFS vs EBS vs S3 (differences & when to use?) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
As per the title of this question, what are the practical differences between AWS EFS, EBS and S3?
My understanding of each:
S3 is a storage facility accessible any where
EBS is a device you can mount onto EC2
EFS is a file system you can mount onto EC2
So why would I use EBS over EFS? Seem like they have the same use cases but minor semantic differences? Although EFS is replicated across AZs where as EBS is just a mounted device. I guess my understanding of EBS is lacking hence I'm unable to distinguish.
Why choose S3 over EFS? They both store files, scale and are replicated. I guess with S3 you have to use the SDK where as with EFS being a file system you can use standard I/O methods from your programming language of choice to create files. But is that the only real difference?
One word answer: MONEY :D
1 GB to store in US-East-1:
(Updated at 2016.dec.20)
Glacier: $0.004/Month (Note: Major price cut in 2016)
S3: $0.023/Month
S3-IA (announced in 2015.09):
$0.0125/Month (+$0.01/gig retrieval charge)
EBS: $0.045-0.1/Month (depends on speed - SSD or not) + IOPS costs
EFS: $0.3/Month
Further storage options, which may be used for temporary storing data while/before processing it:
SNS
SQS
Kinesis stream
DynamoDB, SimpleDB
The costs above are just samples. There can be differences by region, and it can change at any point. Also there are extra costs for data transfer (out to the internet). However they show a ratio between the prices of the services.
There are a lot more differences between these services:
EFS is:
Generally Available (out of preview), but may not yet be available in your region
Network filesystem (that means it may have bigger latency but it can be shared across several instances; even between regions)
It is expensive compared to EBS (~10x more) but it gives extra features.
It's a highly available service.
It's a managed service
You can attach the EFS storage to an EC2 Instance
Can be accessed by multiple EC2 instances simultaneously
Since 2016.dec.20 it's possible to attach your EFS storage directly to on-premise servers via Direct Connect. ()
EBS is:
A block storage (so you need to format it). This means you are able to choose which type of file system you want.
As it's a block storage, you can use Raid 1 (or 0 or 10) with multiple block storages
It is really fast
It is relatively cheap
With the new announcements from Amazon, you can store up to 16TB data per storage on SSD-s.
You can snapshot an EBS (while it's still running) for backup reasons
But it only exists in a particular region. Although you can migrate it to another region, you cannot just access it across regions (only if you share it via the EC2; but that means you have a file server)
You need an EC2 instance to attach it to
New feature (2017.Feb.15): You can now increase volume size, adjust performance, or change the volume type while the volume is in use. You can continue to use your application while the change takes effect.
S3 is:
An object store (not a file system).
You can store files and "folders" but can't have locks, permissions etc like you would with a traditional file system
This means, by default you can't just mount S3 and use it as your webserver
But it's perfect for storing your images and videos for your website
Great for short term archiving (e.g. a few weeks). It's good for long term archiving too, but Glacier is more cost efficient.
Great for storing logs
You can access the data from every region (extra costs may apply)
Highly Available, Redundant. Basically data loss is not possible (99.999999999% durability, 99.9 uptime SLA)
Much cheaper than EBS.
You can serve the content directly to the internet, you can even have a full (static) website working direct from S3, without an EC2 instance
Glacier is:
Long term archive storage
Extremely cheap to store
Potentially very expensive to retrieve
Takes up to 4 hours to "read back" your data (so only store items you know you won't need to retrieve for a long time)
As it got mentioned in JDL's comment, there are several interesting aspects in terms of pricing. For example Glacier, S3, EFS allocates the storage for you based on your usage, while at EBS you need to predefine the allocated storage. Which means, you need to over estimate. ( However it's easy to add more storage to your EBS volumes, it requires some engineering, which means you always "overpay" your EBS storage, which makes it even more expensive.)
Source: AWS Storage Update – New Lower Cost S3 Storage Option & Glacier Price Reduction
I wonder why people are not highlighting the MOST compelling reason in favor of EFS. EFS can be mounted on more than one EC2 instance at the same time, enabling access to files on EFS at the same time.
(Edit 2020 May, EBS supports mounting to multiple EC2 at same time now as well, see:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes-multi.html)
Fixing the comparison:
S3 is a storage facility accessible any where
EBS is a device you can mount onto EC2
EFS is a file system you can mount onto several EC2 instances at the same time
At this point it's a little premature to compare EFS and EBS- the performance of EFS isn't known, nor is its reliability known.
Why would you use S3?
You don't have a need for the files to be 'local' to one or more EC2 instances.
(effectively) infinite capacity
built-in web serving, authentication
Apart from price and features, the throughput also varies greatly (as mentioned by user1677120):
EBS
Taken from EBS docs:
| EBS volume | Throughput | Throughput |
| type | MiB/s | dependent on.. |
|------------|------------|-------------------------------|
| gp2 (SSD) | 128-160 | volume size |
| io1 (SSD) | 0.25-500 | IOPS (256Kib/s per IOPS) |
| st1 (HDD) | 20-500 | volume size (40Mib/s per TiB) |
| sc1 (HDD) | 6-250 | volume size (12Mib/s per TiB) |
Note, that for io1, st1 and sc1 you can burst throughput traffic to at least 125Mib/s, but to 500Mib/s, depending on volume size.
You can further increase throughput by e.g. deploying EBS volumes as RAID0
EFS
Taken from EFS docs
| Filesystem | Base | Burst |
| Size | Throughput | Throughput |
| GiB | MiB/s | MiB/s |
|------------|------------|------------|
| 10 | 0.5 | 100 |
| 256 | 12.5 | 100 |
| 512 | 25.0 | 100 |
| 1024 | 50.0 | 100 |
| 1536 | 75.0 | 150 |
| 2048 | 100.0 | 200 |
| 3072 | 150.0 | 300 |
| 4096 | 200.0 | 400 |
The base throughput is guaranteed, burst throughput uses up credits you gathered while being below the base throughput (so you'll only have this for a limited time, see here for more details.
S3
S3 is a total different thing, so it cannot really be compared to EBS and EFS. Plus: There are no published throughput metrics for S3. You can improve throughput by downloading in parallel (I somewhere read AWS states you would have basically unlimited throughput this way), or adding CloudFront to the mix
To add to the comparison: (burst)read/write-performance on EFS depends on gathered credits. Gathering of credits depends on the amount of data you store on it. More data -> more credits. That means that when you only need a few GB of storage which is read or written often you will run out of credits very soon and throughput drops to about 50kb/s.
The only way to fix this (in my case) was to add large dummy files to increase the rate credits are earned. However more storage -> more cost.
AWS (Amazon Web Services) is well-known for its extensive product line. There are (probably) a few Amazon Web Services ninjas who know exactly how and when to use which Amazon product for which task. The rest of us are in desperate need of assistance.
AWS offers three common storage services: S3, Elastic Block Store (EBS), and Elastic File System (EFS), all of which function differently and provide various levels of performance, cost, availability, and scalability. We'll compare the performance, cost, and accessibility to stored data of these storage options, as well as their use cases.
AWS Storage Options:
Amazon S3 is a basic object storage service that can be used to host website images and videos, as well as data analytics and smartphone and web applications. Data is managed as objects in object storage, which means that all data types are stored in their native formats. With object storage, there is no hierarchy of file relationships, and data objects can be spread through many machines. You can use the S3 service from any computer with an internet connection.
AWS EBS offers block-level data storage that is persistent. Block storage systems are more versatile and provide better capacity than standard file storage since files are stored in several volumes called blocks, which serve as separate hard drives. An Amazon EC2 instance must be mounted with EBS. Business continuity, software testing, and database management are examples of use cases.
AWS EFS is a shared, elastic file storage framework that expands and contracts in response to file additions and deletions. It follows the conventional file storage model, with data organized into folders and subdirectories. EFS is useful for content management systems and SaaS applications. EFS can be mounted on several EC2 instances at once.
Which AWS Cloud Storage Service Is Best?
As always, it depends.
For data storage alone, Amazon S3 is the cheapest choice. S3, on the other hand, has a range of other pricing criteria, including cost per upload, S3 Analytics, and data transfer out of S3 per gigabyte. The cost structure of EFS is the most straightforward.
Amazon S3 is a cloud storage service that can be accessed from anywhere. AWS EBS is only accessible in a single region, while multiple EFS instances can share files across multiple regions.
EBS and EFS both outperform Amazon S3 in terms of IOPS and latency.
With a single API call, EBS can be scaled up or down. You can use EBS for database backups and other low-latency interactive applications that need reliable, predictable performance because it is less expensive than EFS.
Large amounts of data, such as large analytic workloads, are better served by EFS. Users must break up data and distribute it between EBS instances because data at this scale cannot be stored on a single EC2 instance allowed in EBS. The EFS service allows thousands of EC2 instances to be accessed at the same time, allowing vast volumes of data to be processed and analyzed in real-time.
EBS is simple - block level storage which can be attached to an instance from same AZ, and can survive irrespective of instance life.
However, interesting difference is between EFS and S3, and to identify proper use cases for it.
Cost: EFS is approximately 10 times costly than S3.
Usecases:
Whenever we have thousands of instances who needs to process file simultaneously EFS is recommended over S3.
Also note that S3 is object based storage while EFS is file based it implies that whenever we have requirement that files are updated continuously (refreshed) we should use EFS.
S3 is eventually consistent while EFS is strong consistent. In case you can't afford eventual consistency, you should use EFS
In simple words
Amazon EBS provides block level storage .
Amazon EFS provides network-attached shared file storage.
Amazon S3 provides object storage .
AWS EFS, EBS and S3. From Functional Standpoint, here is the difference
EFS:
Network filesystem :can be shared across several Servers; even between regions. The same is not available for EBS case.
This can be used esp for storing the ETL programs without the risk of security
Highly available, scalable service.
Running any application that has a high workload, requires scalable storage, and must produce output quickly.
It can provide higher throughput. It match sudden file system growth, even for workloads up to 500,000 IOPS or 10 GB per second.
Lift-and-shift application support: EFS is elastic, available, and scalable, and enables you to move enterprise applications easily and quickly without needing to re-architect them.
Analytics for big data: It has the ability to run big data applications, which demand significant node throughput, low-latency file access, and read-after-write operations.
EBS:
for NoSQL databases, EBS offers NoSQL databases the low-latency performance and dependability they need for peak performance.
S3:
Robust performance, scalability, and availability: Amazon S3 scales storage resources free from resource procurement cycles or investments upfront.
2)Data lake and big data analytics: Create a data lake to hold raw data in its native format, then using machine learning tools, analytics to draw insights.
Backup and restoration: Secure, robust backup and restoration solutions
Data archiving
S3 is an object store good at storing vast numbers of backups or user files. Unlike EBS or EFS, S3 is not limited to EC2. Files stored within an S3 bucket can be accessed programmatically or directly from services such as AWS CloudFront. Many websites use it to hold their content and media files, which may be served efficiently via AWS CloudFront.
The main difference between EBS and EFS is that EBS is only accessible from a single EC2 instance in your particular AWS region, while EFS allows you to mount the file system across multiple regions and instances.
Finally, Amazon S3 is an object store good at storing vast numbers of backups or user files.
Amazon EBS provides block level storage - It is used to create a filesystem on it and store files.
Amazon EFS - its shared storage system similar like NAS/SAN. You need to mount it to unix server and use it.
Amazon S3 - It is object based storage where each item is stored with a http URL.
One of the difference is - EBS can be attached to 1 instance at a time and EFS can be attached to multiple instances that why shared storage.
S2 plain object storage cannot be mounted.
EFS & S3 have the same purpose, you can store any kind of object or files.
But for me the only difference is EFS is allowing you to have a traditional file system in the VM(EC2) cloud with more flexibility like you can attach to multiple instances.
S3, on the other hand, is a separate flexible and elastic server for your objects. It can be used for your static files, images, videos or even hosting static app (js).
EBS is obviously for block storage where you can install OS or anything related to your OS.
This question is very much answered by other people, I just want to make a point whenever deciding on any service to be in AWS is that understanding the use case for each and also see the solution that the service will provide in terms of the Well-Architected Framework, do you need High Availability, Fault Torelant, Cost optimization. This will help to decide on any kind of service to be used.