Cassandra on AWS - amazon-web-services

I'm new to AWS and also to Cassandra. I just read about EBS and S3 storage available in AWS. I was trying to figure out if we have Cassandra installed in EC2, which storage would it use? EBS or S3? Or is there other storage? I'm little confused with this. Please help me understand this.
Thanks
Aravind

You shouldn't run Cassandra on EBS, as recommended per Datastax itself :
"EBS volumes are not recommended for Cassandra data volumes for the following reasons:
EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing."
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html

The answer above comes from Cassandra 1.2, a relatively old version. Documentation for newer versions of Cassandra indicate that EBS Optimized instances using GP2 SSD can be used for production workloads.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html
Things that changed since then were the creation of EBS Optimized instances, which reduces and/or eliminates noisy neighbor throughput problems, and using GP2 SSD for EBS storage.
If you are just getting started, I would recommend EBS Optimized. The performance should be pretty good, but you gain a critical ability -> creating snapshots. This reduces the risk of your instance becoming unstable because you would have S3-backed volume snapshots for AWS to rebuild data from if a drive died.
This reduces the need to setup your Cassandra cluster across regions. One of the concerns that you have to build around when using Ephemeral is a whole region potentially going down, which could wipe out your entire cluster if you didn't build a multi-region cluster. With EBS, this isn't really a concern.

For Cassandra you need to use EBS. S3 is an object store with and API to store and retrieve objects, but not easy querying mechanisms. The use cases include backup and archiving, Disaster Recovery, Static Website Hosting, etc
However, you can use S3 for Cassandra backup.
You can also consider ephemeral disks (as Jeff mentions) and storage which comes with AWS instance.

Related

Accessing instance storage in AWS SageMaker notebooks

I'm trying to train a model using AWS SageMaker notebooks and am disappointed with how slowly the model is training. I think my bottleneck lies with the IOPS speed to the persistent storage (EFS and EBS) my SageMaker notebooks are accessing for the dataset.
First, I tried training on a SageMaker Studio ml.g4dn.xlarge instance, then moved everything over to a SageMaker notebook ml.g4dn.xlarge instance through Jupyter. Even though g4dn.xlarge instances come with a physically wired 125GB SSD, I'm unable to access it because SageMaker Studio automatically creates an EFS store, and SageMaker notebook instances automatically create an EBS store. How could I store my dataset on the 125GB SSD instead of EFS or EBS to speed up the IOPS?
It is clear that there are instances with memory optimised for large amounts of data. In your case, if the dataset is given as input to the model with exactly that size (so there is no upstream preprocessing to lighten this amount of data), you must know that the g4dn is EBS optimised.
The most obvious answer i can think of is to use an S3 bucket
From "Maximum transfer speed between Amazon EC2 and Amazon S3":
Traffic between Amazon EC2 and Amazon S3 can leverage up to 100 Gbps
of bandwidth to VPC endpoints and public IPs in the same region.
Besides being very fast and performant, it is also the best solution in terms of design for all components of your project on AWS. Clearly, it entails different costs and a different architecture, but you will enjoy the maximum speed that the set of AWS services can offer you (and possibly require special configurations for even better performance).
My advice is to follow the AWS guidelines for developing a complex project from scratch: Build, training and deployment of machine learning models.

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

S3 vs. EBS Performance

What are the performance implications of using S3 vs. EBS?
Say my app needs to access files fast. Will EBS be much faster than S3? I would like to use S3 because I don't want a space limitations, but am worried about performance.
In general EBS will perform better than S3 (EBS has 4 variations - gp2, io1, st1, sc1). With io1 you get to specify IOPS capabilities (S3 has no user specified performance control). gp2 has IO bursts (up to 3000 IOPS) and IO credits (5.4 Million) which may be good for some applications. Check out http://www.slideshare.net/AmazonWebServices/deep-dive-maximizing-ec2-and-ebs-performance
Hope it is understood that both use different access methods. S3 is an object store, EBS is a block storage.
(Please also note that EBS is also network attached, not directly attached to EC as far as I know)
You should consider the AWS Elastic File System as an alternative to S3 and EBS. Its speed will likely fall between S3 and EBS but it does provide ways to control performance and it is not limited in terms of size. Amazon EFS Performance details: http://docs.aws.amazon.com/efs/latest/ug/performance.html
This should help:
Regards,
Aditya Garg
EBS will always be more faster if you compare EBS vs S3 for data transfer, as EBS is directly attached to the EC2 instance. But again the storage limitation comes in picture for EBS.
So if you want to achieve both less latency and unlimited storage, I will suggest attaching S3 bucket as a volume to your EC2 instance, gave a very good data transfer speeds in my case.
Here is a guide that is really good for achieving.
https://fullstacknotes.com/mount-aws-s3-bucket-to-ubuntu-file-system/

Recommended AWS storage type for Cassandra?

I need to deploy Cassandra on AWS but am confused as to what type of AWS storage is most suitable for Cassandra.
The Datastax documentation here:
http://docs.datastax.com/en/cassandra/3.0/cassandra/planning/planPlanningEC2.html
says that EBS volumes are recommended. At the same time the Datastax AMI documentation:
http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installAMI.html
says that:
Uses RAID0 ephemeral disks for data storage and commit logs.
Launches EBS-backed instances for faster start-up, not database
storage.
So which one is the recommended storage type for Cassandra? The EBS storage or the Instance storage?
Many of the new eC2 instances are EBS only (http://www.ec2instances.info/) I am not sure when the cassandra document was written but EBS disk have improved a lot recently and amazon launches new type frequently, so you will be able to find what you're looking for with one of the type
You can check https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html?icmpid=docs_ec2_console and its recommended Provisioned IOPS SSD (io1)
To add a reason why AWS is moving to EBS and why it would be good for cassandra data is because of ephemeral type of data, you might not want your data to disappear if your instance is terminated (because of a crash or a stop you made) at least when your instance is gone, you still have access to your data and can attach the EBS volume to a new instance (really useful also when up/down-grading instances)
I came upon this presentation, which clearly answers the question with a very interesting use case:
https://www.youtube.com/watch?v=1R-mgOcOSd4
To summarize:
EBS has changed a lot since 2011 when major companies like Netflix
had problems with it.
EBS and GP2 are now the recommended storage
for Cassandra and you should not expect any bottlenecks there.
Datastax have recently updated their documentation to also recommend
EBS:
http://docs.datastax.com/en/cassandra/3.0/cassandra/planning/planPlanningEC2.html
No doubt EBS,
Memory optimized boxes are best suited for cassandra
T2
T2 are Burstable Performance Instances that offer a baseline level of CPU performance with the capability to burst above the baseline
M4
M4 instances are the most recent general-purpose instances. The M4 family of instances offers a balance of memory, network, and compute resources, and it is a better option for several applications
C4
These instances are recent additions to the compute-optimized instances that feature maximum performance processors with the lowest compute/price performance in EC2 Instance types.
X1
These instances are best suited for enterprise-class, large-scale, in-memory applications and offer the lowest price for each GiB of RAM among AWS EC2 instance types. The X1 instances are the latest addition to the EC2 memory-optimized instance group and are intended for executing high-scale, in-memory databases and in-memory applications over the AWS cloud.
for pricing and other information
https://aws.amazon.com/ec2/instance-types/

Do Amazon High I/O instance guarantee disk persistence?

The High I/O instance in EC2 uses SSD. How does one run a database on such an instance while guaranteeing persistance of data?
From my limited understanding, I'm suppose to use Elastic Block Store (EBS) so that even if the machine goes down the data on the disk doesn't disappear. On the other hand the instance storage SSD of a High I/O instance is ephemeral and can't be used for database storage because if, for example, the machine loses power the data image isn't preserved. Is my understanding correct?
Point 1) If your workloads need High IO SSD for DB, then you should have Master Slave setup. Ideally 1 master and 2 slaves spread across 3 AZ's is suggested. Even if there is an outage on single AZ the alternate AZ's can handle the load and serve your High availability needs. Between master - slave you can employ synchronous, semi or async replication depending upon your DB. This solution is costlier.
Point 2) Generally if your DB is OLTP in nature, then Amazon EBS PIOPS + EBS optimized gives you consistent IOPS. A Single EBS Volume can provide 4000 IOPS and you can RAID 0 multiple volumes and gain 10k+ IOPS for performance. Lots of customers are taking this route in AWS. Even though you may use EBS for persistence, it is still recommended to go with Master-Slave architecture for High Availability. I have written detailed articles on this topic in blog, refer them for more information.
It is the same as other ephemeral storage, it does not guarantee persistence. Persistance is handled by replication between instances with at least one instance writing to an EBS volume.
If you want your data to persist, you're going to need to use EBS. Building a database on an ephemeral drive, regardless of performance, seems a dubious design choice.
EBS now offers 4K IOPS volumes, which is, depending on your database requirements, quite possibly more than sufficient.
My next question would really be: Do you want to host/run your own database?
Turnkey products such as RDS and DynamoDB may be sufficient for your needs. Using them is much easier than setting up and managing your own database. RDS is now advertising "You can now provision up to 3TB and 30,000 IOPS per DB Instance". That's enough database horsepower for many, many problem sets.