AWS confluent quickstart configures Kafka log.dirs with 4 512GB EBS block devices with RAID-0 striping for higher throughput and also helps bypass the 1TB limit of block devices without provisioned IOPS. I have just learned that losing a block device in a RAID-0 group will cause all other devices in that group to fail, can someone help clarify this
Now that Kafka allows multiple directories under log.dirs, can we mount each block device under a different mount point and configure them as a list of directories under log.dirs?
If that is possible(which it is, I guess), what are the trade-offs?
A couple things to note.
First, there isn't a 1TB limit on EBS volumes. As of the moment, Amazon st1 volumes can be as big as 16TB. These are the kind of volumes you want to use in your Kafka deployment because they're optimized for sequential writes, which is what Kafka does best.
Secondly, yes--Kafka allows for multiple log directories. This allows you to spread storage across disks so that you're not overtaxing a single disk with all of your io. That said, having multiple log directories is going to be better than having a single directory, especially if you're dealing with large amounts of data--but there are other factors to keep in mind, too, when dealing with EBS. If you're opting for smaller st1 volumes rather than a monolithic st1 volume, that means you'll have a smaller burst bucket and a lower iops baseline per volume. Once you go over your iops baseline, you'll start consuming iops from your bucket--see details here. It's important to monitor your burst balance in CloudWatch to make sure it's not being routinely depleted, which usually results in your whole cluster slowing down and your broker's request and response queues filling up, which could lead to catastrophic failures across consumer and producer applications.
As for RAID striping, if you enable it on each of your EBS volumes, all of your mounted volumes will be in the same RAID group, which means that Kafka log files will be spread across devices in the group rather than residing on a single device, the consequence of which is that if one of those devices fails, the other devices in the group will fail, too. This is supposed to be more performant than other setups, however.
Before Kafka 1.0 there was no operational difference between a single disk failing on a broker and every disk failing on that broker--both would result in the broker going down. See discussion here.
Update: As of Kafka 1.0, a failed disk will not bring down the broker (see docs). Thanks to #RobinMoffat for pointing out. Ultimately with RAID-0 striping, you're trading the ability to quickly recover from a failed disk for overall io performance. That is, all partitions on a broker with a single failed disk will need to be reassigned with striping, but without striping, only those partitions on the failed disk will need to be reassigned.
Related
Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.
i understand "spark.deploy.spreadOut" when set to true can benefit HDFS, but for S3 can setting to false have a benefit over true?
If you're running Hadoop and HDFS, it would not benefit you to use Spark Standalone scheduler for which that property applies. Rather, you should be running YARN, and the ResourceManager determines how executors are spread
If you are running Standalone scheduler in EC2, then setting that property will help, and the default is true.
In other words, where you're reading the data from is not the deciding factor here, the deploy mode for the master is
The better performance benefits would come from the number of files you're trying to read, and which formats you store the data in
This really depends on your workload.
If your S3 access is massive and is constrained by instance network IO,
setting spark.deploy.spreadOut=true will help, because it will spread it over more instances increasing the total network bandwidth available to the app.
But for the most workloads it will make no difference.
There is also cost consideration for "spark.deploy.spreadOut" parameter.
If your spark processing is large scale, you are likely using multiple AZs.
Default value "spark.deploy.spreadOut"= true will cause your workers to generate more network traffic on data shuffling, causing inter-AZ traffic.
Inter-AZ traffic on AWS can get costly
if the network traffic volume is high enough, you might want to cluster apps tighter by spark.deploy.spreadOut"= false, instead of spreading them because of the cost issue.
I am storing users' code in file system, at present EBS in AWS. I am looking improving the availability and want to reduce the chances of outage due to EBS going down. EFS appears to be a reasonable option.
I understand EFS will be slower than EBS and EFS is more expensive than EBS. I want to know, if there is any performance benchmark done to measure the read and write latencies of EFS and comparison with EBS?
This AWS forums thread shows you some of the problems that some customers have had with eFS latency and AWS reaction. Some customers assert they have had 1+ second latency, to which AWS support say that's not normal, they'll investigate.
My current experience in EU-West appears to suggest that for a series of 150,000 small read operations of about 2.5KB each, my EC2<->EFS is maxing out at 200 read ops per second, so we might guess at no more than 1/200th of a second or 5ms for typical effective latency.
I say "effective latency" because that's really reporting a bandwidth, not a latency. I haven't written timing code to measure round-trip latency.
You can improve it by paying for a bigger drive (which includes bigger IOPS in the price) or for reserved IOPS.
EFS is a Network Filesystem(NFS). It provides a file system interface, file system access semantics (such as strong consistency and file locking), and concurrently-accessible storage for up to thousands of Amazon EC2 instances. Ofcourse there would be read/write latency compared to EBS as EBS is designed for low-latency access to data.
EBS provides different volume types, which differ in performance characteristics and price, so that you can tailor your storage performance and cost to the needs of your applications.
EFS is easy to use and offers a simple interface that allows you to create and configure file systems quickly and easily. With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files, so your applications have the storage they need, when they need it.
Perfromance Overview of EFS: http://docs.aws.amazon.com/efs/latest/ug/performance.html
Performance Overview of EBS:http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
I'm new to AWS and also to Cassandra. I just read about EBS and S3 storage available in AWS. I was trying to figure out if we have Cassandra installed in EC2, which storage would it use? EBS or S3? Or is there other storage? I'm little confused with this. Please help me understand this.
Thanks
Aravind
You shouldn't run Cassandra on EBS, as recommended per Datastax itself :
"EBS volumes are not recommended for Cassandra data volumes for the following reasons:
EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing."
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html
The answer above comes from Cassandra 1.2, a relatively old version. Documentation for newer versions of Cassandra indicate that EBS Optimized instances using GP2 SSD can be used for production workloads.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html
Things that changed since then were the creation of EBS Optimized instances, which reduces and/or eliminates noisy neighbor throughput problems, and using GP2 SSD for EBS storage.
If you are just getting started, I would recommend EBS Optimized. The performance should be pretty good, but you gain a critical ability -> creating snapshots. This reduces the risk of your instance becoming unstable because you would have S3-backed volume snapshots for AWS to rebuild data from if a drive died.
This reduces the need to setup your Cassandra cluster across regions. One of the concerns that you have to build around when using Ephemeral is a whole region potentially going down, which could wipe out your entire cluster if you didn't build a multi-region cluster. With EBS, this isn't really a concern.
For Cassandra you need to use EBS. S3 is an object store with and API to store and retrieve objects, but not easy querying mechanisms. The use cases include backup and archiving, Disaster Recovery, Static Website Hosting, etc
However, you can use S3 for Cassandra backup.
You can also consider ephemeral disks (as Jeff mentions) and storage which comes with AWS instance.
The High I/O instance in EC2 uses SSD. How does one run a database on such an instance while guaranteeing persistance of data?
From my limited understanding, I'm suppose to use Elastic Block Store (EBS) so that even if the machine goes down the data on the disk doesn't disappear. On the other hand the instance storage SSD of a High I/O instance is ephemeral and can't be used for database storage because if, for example, the machine loses power the data image isn't preserved. Is my understanding correct?
Point 1) If your workloads need High IO SSD for DB, then you should have Master Slave setup. Ideally 1 master and 2 slaves spread across 3 AZ's is suggested. Even if there is an outage on single AZ the alternate AZ's can handle the load and serve your High availability needs. Between master - slave you can employ synchronous, semi or async replication depending upon your DB. This solution is costlier.
Point 2) Generally if your DB is OLTP in nature, then Amazon EBS PIOPS + EBS optimized gives you consistent IOPS. A Single EBS Volume can provide 4000 IOPS and you can RAID 0 multiple volumes and gain 10k+ IOPS for performance. Lots of customers are taking this route in AWS. Even though you may use EBS for persistence, it is still recommended to go with Master-Slave architecture for High Availability. I have written detailed articles on this topic in blog, refer them for more information.
It is the same as other ephemeral storage, it does not guarantee persistence. Persistance is handled by replication between instances with at least one instance writing to an EBS volume.
If you want your data to persist, you're going to need to use EBS. Building a database on an ephemeral drive, regardless of performance, seems a dubious design choice.
EBS now offers 4K IOPS volumes, which is, depending on your database requirements, quite possibly more than sufficient.
My next question would really be: Do you want to host/run your own database?
Turnkey products such as RDS and DynamoDB may be sufficient for your needs. Using them is much easier than setting up and managing your own database. RDS is now advertising "You can now provision up to 3TB and 30,000 IOPS per DB Instance". That's enough database horsepower for many, many problem sets.