I have the following:
2 pod replicas, load balanced.
Each replica having 2 containers sharing network.
What I am looking for is a shared volume...
I am looking for a solution where the 2 pods and each of the containers in the pods can share a directory with read+write access. So if a one container from pod 1 writes to it, containers from pod 2 will be able to access the new data.
Is this achievable with persistent volumes and PVCs? if so what do i need and what are pointers to more details around what FS would work best, static vs dynamic, and storage class.
Can the volume be an S3 bucket?
Thank you!
There are several options depending on price and efforts needed:
Simplest but a bit more expensive solution is to use EFS + NFS Persistent Volumes. However, EFS has serious throughput limitations, read here for details.
You can create pod with NFS-server inside and again mount NFS Persistent Volumes into pods. See example here. This requires more manual work and not completely highly available. If NFS-server pod fails, then you will observe some (hopefully) short downtime before it gets recreated.
For HA configuration you can provision GlusterFS on Kubernetes. This requires the most efforts but allows for great flexibility and speed.
Although mounting S3 into pods is somehow possible using awful crutches, this solution has numerous drawbacks and overall is not production grade. For testing purposes you can do that.
Refer to https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes for all available volume backends (You need ReadWriteMany compatibility)
As you can find there AWSElasticBlockStore doesn't support it. You will need any 3rd party volume provider which supports ReadWriteMany.
UPD: Another answer https://stackoverflow.com/a/51216537/923620 suggests that AWS EFS works too.
Related
Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.
Anyone have a sound strategy for implementing NFS on AWS in such a way that it's not a SPoF (single point of failure), or at the very least, be able to recover quickly if an instance crashes?
I've read this SO post, relating to the ability to share files with multiple EC2 instances, but it doesn't answer the question of how to ensure HA with NFS on AWS, just that NFS can be used.
A lot of online assets are saying that AWS EFS is available, but it is still in preview mode and only available in the Oregon region, our primary VPC is located in N. Cali., so can't use this option.
Other online assets are saying that GlusterFS is a way to go, but after some research I just don't feel comfortable implementing this solution due to race conditions and performance concerns.
Another options is SoftNAS but I want to avoid bringing in an unknown AMI into a tightly controlled, homogeneous environment.
Which leaves NFS. NFS is what we use in our dev environment and works fine, but it's dev, so if it crashes we go get a couple beers while systems fixes the problem, but on production, this is obviously a no go.
The best solution I can come up with at this point is to create an EBS and two EC2 instances. Both instances will be updated as normal (via puppet) to maintain stack alignment (kernel, nfs libs etc), but only one instance will mount the EBS. We set up a monitor on the active NFS instance, and if it goes down, we are notified and we manually detach and attach to the backup EC2 instance. I'm thinking we also create a network interface that can also be de/re-attached so we only need to maintain a single IP in DNS.
Although I suppose we could do this automatically with keepalived, and a IAM policy that will allow the automatic detachment/re-attachment.
--UPDATE--
It looks like EBS volumes are tied to specific availability zones, so re-attaching to an instance in another AZ is impossible. The only other option I can think of is:
Create EC2 in each AZ, in public subnet (each have EIP)
Create route 53 healthcheck for TCP:2049
Create route 53 failover policies for nfs-1 (AZ1) and nfs-2 (AZ2)
The only question here is, what's the best way to keep the two NFS servers in-sync? Just cron an rsync script between them?
Or is there a best practice that I am completely missing?
There are a few options to build a highly available NFS server. Though I prefer using EFS or GlusterFS because all these solutions have their downsides.
a) DRBD
It is possible to synchronize volumes with the help of DRBD. This allows you to mirror your data. Use two EC2 instances in different availability zones for high availability. Downside: configuration and operation is complex.
b) EBS Snapshots
If a RPO of more than 30 minutes is reasonable you can use periodic EBS snapshots to be able to recover from an outage in another availability zone. This can be achieved with an Auto Scaling Group running a single EC2 instance, a user-data script and a cronjob for periodic EBS snapshots. Downside: RPO > 30 min.
c) S3 Synchronisation
It is possible to synchronize the state of an EC2 instance acting as NFS server to S3. The standby server uses S3 to stay up to date. Downside: S3 sync of lots of small files will take too long.
I recommend watching this talk from AWS re:Invent: https://youtu.be/xbuiIwEOCAs
AWS has reviewed and approved a number of SoftNAS AMIs, which are available on AWS Marketplace. The jointly published SoftNAS Architecture on AWS White Paper provides more details:
Security (pages 4-11)
HA across AZs (pages 13-14)
You can also try a 30 day free trial to see if it meets your needs.
http://softnas.com/tryaws
Full disclosure: I work for SoftNAS.
How to share S3 storage between multiple EC2 instances? I am beginner to AWS, I need to know how to share a drive between multiple EC2 instances.
Currently you can't, and S3 is your best bet, but AWS does have their Elastic File System in BETA currently, and there is the possibility it will be available for general availability anytime (I have no inside knowledge, just a guess - maybe even this week, they often have lots of announcements during their annual conference going on now).
You can signup for 'preview' access and see if it suits your needs, and then decide if you can wait for it to become fully available.
AWS EFS will allow you to share a drive between instances:
Amazon EFS supports the Network File System version 4 (NFSv4)
protocol, so the applications and tools that you use today work
seamlessly with Amazon EFS. Multiple Amazon EC2 instances can access
an Amazon EFS file system at the same time, providing a common data
source for workloads and applications running on more than one
instance.
https://aws.amazon.com/efs/
EFS (still in beta, half a year later) indeed looks like the best option. But as EFS is basically just a managed, highly available NFS server, it should be possible to roll out some other NFS solution first, and replace it with EFS once it's finally available.
One promising candidate seems dCache, which is
a system for storing and retrieving huge amounts of data, distributed
among a large number of heterogenous server nodes, under a single
virtual filesystem tree with a variety of standard access methods.
It is used by research institutions all over the world to store over 100PB of data, and it provides an NFSv4 interface. Not sure how easy setup on AWS would be, or what the performance would be like.
https://www.dcache.org/
I'm new to AWS and also to Cassandra. I just read about EBS and S3 storage available in AWS. I was trying to figure out if we have Cassandra installed in EC2, which storage would it use? EBS or S3? Or is there other storage? I'm little confused with this. Please help me understand this.
Thanks
Aravind
You shouldn't run Cassandra on EBS, as recommended per Datastax itself :
"EBS volumes are not recommended for Cassandra data volumes for the following reasons:
EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing."
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html
The answer above comes from Cassandra 1.2, a relatively old version. Documentation for newer versions of Cassandra indicate that EBS Optimized instances using GP2 SSD can be used for production workloads.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html
Things that changed since then were the creation of EBS Optimized instances, which reduces and/or eliminates noisy neighbor throughput problems, and using GP2 SSD for EBS storage.
If you are just getting started, I would recommend EBS Optimized. The performance should be pretty good, but you gain a critical ability -> creating snapshots. This reduces the risk of your instance becoming unstable because you would have S3-backed volume snapshots for AWS to rebuild data from if a drive died.
This reduces the need to setup your Cassandra cluster across regions. One of the concerns that you have to build around when using Ephemeral is a whole region potentially going down, which could wipe out your entire cluster if you didn't build a multi-region cluster. With EBS, this isn't really a concern.
For Cassandra you need to use EBS. S3 is an object store with and API to store and retrieve objects, but not easy querying mechanisms. The use cases include backup and archiving, Disaster Recovery, Static Website Hosting, etc
However, you can use S3 for Cassandra backup.
You can also consider ephemeral disks (as Jeff mentions) and storage which comes with AWS instance.
AWS releases new Elastic File System this week. See http://aws.amazon.com/efs/
The page doesn't contain many details. I'd like to know its performance comparing to S3, as well as other differences.
You almost can't compare EFS and S3 because they are two very different things, even though there is some overlap in their functionality, or at least their apparent functionality.
They both store things and they both have a storage pricing model that scales linearly with usage over time.
But S3 is an object store with an HTTP interface and a mixed consistency model....
...while EFS is an actual filesystem with an NFS interface and as such will almost certainly offer immediate consistency.
S3, coupled with a utility like s3fs can be used in a way that mimics a filesysem, but not to the point of behaving in all ways like an actual filesystem.
One way of looking at EFS is that it is an answer to the question, "how do I attach an EBS volume to multiple instances at the same time?" Previously, of course, the answer was, "you can't." You can mount the filesystem exposed by EFS on any nunber of instances and the result should be very similar to what you'd see if you had a "shared volume."
Its performance compared to S3 is not really a fair comparison, again, because they are different things for different purposes, but EFS will almost without question be "faster" by any meaningful definition of the word.
Also, no software should be required in order to mount an EFS filesystem on a Linux system.
As already mentioned EFS is completely different to S3.
The simplest way to look at is to look at what the underlying technology is.
S3 is an object store, meaning it is a higher layer data storage system, essentially it is a database "blob" storage, storing data in an underlying simple database as an object.
It's designed for Write once Read many access, perfect for media data like image or video particularly as it is distributed and offers a very high level of redundancy.
EFS is a Network Storage system, underlying it is a storage array (SAN) and it offers the standard protocol for multi session network file systems (NFS)
It's built on high speed SSD drives and is intended for shared storage for your ec2 instances, think file servers.
It's been a long time coming for AWS and IMO this was one of the biggest missing key components for aws to really be a competitor to on-premise enterprise data centers.
Performance for EFS will be scalable and although I have not seen the details yet I am sure it will allow for provisioned IOPS just like EBS.
EFS is also considerably (10x) more expensive than S3 at $0.30 vs $0.03. From an IOPs perspective you should see better performance from EFS as it's SSD based and doesn't have the overheard of HTTP on top as does S3. It's essential NAS as a Service.
Two addition differences between the two:
AWS S3 offers Server-Side Encryption: http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html
The same is not currently offered in AWS EFS
Files stored in AWS S3 as public, are accessible via a public URL to anyone.
In AWS EFS however, in order to achieve the same, you'll need to deploy a web server that will serve your files.
Choosing between EFS and S3 is depend on your usage pattern
EFS availability and durability is same as S3
but both have different usage patterns
S3 have four common usage patterns:
static web content
host entire static websites
store data for large-scale analytics.
backup and archiving of critical data.
EFS is designed for applications thats concurrently access data from multiple EC2 instances.
simply, by having one EFS you can attach it to multiple EC2.. you can't do that with EBS.
Amazon claim that S3 performance is more than any current users needs.
EFS performance has two modes
General Purpose
Max I/O
General Purpose is the default and it's appropriate for most operations type.
but, if your workload will exceed 7000 file operations per second then Max I/O is your target