EBS vs EFS read and write latencies - amazon-web-services

I am storing users' code in file system, at present EBS in AWS. I am looking improving the availability and want to reduce the chances of outage due to EBS going down. EFS appears to be a reasonable option.
I understand EFS will be slower than EBS and EFS is more expensive than EBS. I want to know, if there is any performance benchmark done to measure the read and write latencies of EFS and comparison with EBS?

This AWS forums thread shows you some of the problems that some customers have had with eFS latency and AWS reaction. Some customers assert they have had 1+ second latency, to which AWS support say that's not normal, they'll investigate.
My current experience in EU-West appears to suggest that for a series of 150,000 small read operations of about 2.5KB each, my EC2<->EFS is maxing out at 200 read ops per second, so we might guess at no more than 1/200th of a second or 5ms for typical effective latency.
I say "effective latency" because that's really reporting a bandwidth, not a latency. I haven't written timing code to measure round-trip latency.
You can improve it by paying for a bigger drive (which includes bigger IOPS in the price) or for reserved IOPS.

EFS is a Network Filesystem(NFS). It provides a file system interface, file system access semantics (such as strong consistency and file locking), and concurrently-accessible storage for up to thousands of Amazon EC2 instances. Ofcourse there would be read/write latency compared to EBS as EBS is designed for low-latency access to data.
EBS provides different volume types, which differ in performance characteristics and price, so that you can tailor your storage performance and cost to the needs of your applications.
EFS is easy to use and offers a simple interface that allows you to create and configure file systems quickly and easily. With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files, so your applications have the storage they need, when they need it.
Perfromance Overview of EFS: http://docs.aws.amazon.com/efs/latest/ug/performance.html
Performance Overview of EBS:http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

Related

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Can I improve performance of a build server by switching from EBS to EFS storage?

I am wondering if I can obtain serious improvements in performance by reconfiguring our Jenkins build server to use EFS (AWS NFS implementation) instead of EBS disks.
While EFS is about 3x more expensive per gigabyte the real cost is probably going to be only around 2x more expensive because on EFS you pay only for used space, as opposed to EBS where you pay all of it.
Also EFS has another very important advantage: it does scale without having to take anything down for upgrade. Resizing EBS disk is a time consuming operations that involves downtimes.
This question is not about the cost, is more about performance as if I can improve the build speed even by 20% the storage cost would clearly be overcomed (not to mention the advantage of needing less maintenance later).
From my direct experience, this is a very bad idea for a Jenkins server. We thought to save ourselves the administrative and automation overheads of creating, expanding, and otherwise managing EBS volumes, so we put our Jenkins home on an EFS mount.
The trouble is that Jenkins builds often involve lots of tiny files (for example, Javascript npm modules), which are the worst-case scenario for EFS, and indeed any NFS implementation. File-based storage requires server round-trips for each file access. In our specific case, cleaning out workspaces of even small projects can take several minutes on a Jenkins server with its home directory on EFS.
Save yourself the trouble, learn from our mistakes; we are going to undo this choice. Your Jenkins server will almost certainly be much slower than one based on EBS.
Here are my intermediary results of my attempt of using AWS EFS for storing Jenkins home directory (which includes the workspaces).
My mistake was that I missed this well hidden page about EFS performance which I would summarize that unless you want to store huge amount of data on EFS, it can burst only for 0.5% of the day.... where burst is what we would all expect as the normal performance.
It seems the EFS is not only damn slow, it is extremely slow, so slow that I failed to do an rsync of only 8GB of data from the local EBS volume to the EFS one.
root#hostname:/efs# time rsync -ah --info=progress2 /jenkins/ /efs
816.72M 6% 609.02kB/s 0:21:49 (xfr#12490, ir-chk=1009/273305)
2.71G 18% 871.55kB/s 0:50:40 (xfr#42955, ir-chk=1070/306870)
The average speed was around 1.5mB/s which is ridiculous.
Due to this I decided not even to test running a jenkins build job jenkins on it.
I tried to see if this was caused by AWS that was limiting my speed but the monitoring on EFS does not point that's the case. I think that's the expected performance that you may get if you have to handle lots of small files. Have a look at the screenshot:

AWS EBS references

Perhaps this isn't so much a code question as a definitional question, but would someone be able to explain to me what the six line items below represent?
EBS has 3 types of storage (in order from most expensive to cheapest):
Provisioned I/O. These are SSD volumes with a performance guarantee. With these volumes you not only pay for the size of the volume, but also the number of IOPS you have provisioned. These volumes should only be used when performance is very important.
General Purpose SSD. These volumes provide improved performance over Magnetic volumes at a somewhat higher cost. Probably the best choice for most general purpose uses.
Magnetic. This type of storage uses magnetic disks and is the cheapest and slowest. Good for bulk data storage that doesn't have any performance requirement.
The other two items not covered by the above volume types are IO requests, which occur any time data blocks are read or written to any volume. Also snapshots are copies of volumes stored on S3.
Amazon Elastic Block Storage is offered in three flavors: Magnetic, PIOPS SSD and General Purpose SSD.
Each flavor will offer different performance and prices, that you can check in the EBS pricing page.
These lines looks like a budget showing how much of each is consumed by your project :)

Do Amazon High I/O instance guarantee disk persistence?

The High I/O instance in EC2 uses SSD. How does one run a database on such an instance while guaranteeing persistance of data?
From my limited understanding, I'm suppose to use Elastic Block Store (EBS) so that even if the machine goes down the data on the disk doesn't disappear. On the other hand the instance storage SSD of a High I/O instance is ephemeral and can't be used for database storage because if, for example, the machine loses power the data image isn't preserved. Is my understanding correct?
Point 1) If your workloads need High IO SSD for DB, then you should have Master Slave setup. Ideally 1 master and 2 slaves spread across 3 AZ's is suggested. Even if there is an outage on single AZ the alternate AZ's can handle the load and serve your High availability needs. Between master - slave you can employ synchronous, semi or async replication depending upon your DB. This solution is costlier.
Point 2) Generally if your DB is OLTP in nature, then Amazon EBS PIOPS + EBS optimized gives you consistent IOPS. A Single EBS Volume can provide 4000 IOPS and you can RAID 0 multiple volumes and gain 10k+ IOPS for performance. Lots of customers are taking this route in AWS. Even though you may use EBS for persistence, it is still recommended to go with Master-Slave architecture for High Availability. I have written detailed articles on this topic in blog, refer them for more information.
It is the same as other ephemeral storage, it does not guarantee persistence. Persistance is handled by replication between instances with at least one instance writing to an EBS volume.
If you want your data to persist, you're going to need to use EBS. Building a database on an ephemeral drive, regardless of performance, seems a dubious design choice.
EBS now offers 4K IOPS volumes, which is, depending on your database requirements, quite possibly more than sufficient.
My next question would really be: Do you want to host/run your own database?
Turnkey products such as RDS and DynamoDB may be sufficient for your needs. Using them is much easier than setting up and managing your own database. RDS is now advertising "You can now provision up to 3TB and 30,000 IOPS per DB Instance". That's enough database horsepower for many, many problem sets.