Spark - can "spark.deploy.spreadOut = false" give performance benefit on S3 - amazon-web-services

i understand "spark.deploy.spreadOut" when set to true can benefit HDFS, but for S3 can setting to false have a benefit over true?

If you're running Hadoop and HDFS, it would not benefit you to use Spark Standalone scheduler for which that property applies. Rather, you should be running YARN, and the ResourceManager determines how executors are spread
If you are running Standalone scheduler in EC2, then setting that property will help, and the default is true.
In other words, where you're reading the data from is not the deciding factor here, the deploy mode for the master is
The better performance benefits would come from the number of files you're trying to read, and which formats you store the data in

This really depends on your workload.
If your S3 access is massive and is constrained by instance network IO,
setting spark.deploy.spreadOut=true will help, because it will spread it over more instances increasing the total network bandwidth available to the app.
But for the most workloads it will make no difference.
There is also cost consideration for "spark.deploy.spreadOut" parameter.
If your spark processing is large scale, you are likely using multiple AZs.
Default value "spark.deploy.spreadOut"= true will cause your workers to generate more network traffic on data shuffling, causing inter-AZ traffic.
Inter-AZ traffic on AWS can get costly
if the network traffic volume is high enough, you might want to cluster apps tighter by spark.deploy.spreadOut"= false, instead of spreading them because of the cost issue.

Related

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

Dynamodb vs Redis

We're using AWS, and considering to use DynamoDB or Redis on our new service.
Below is our service's character
Insert/Delete occur over between hundreds and thousands per minute, and will be larger later.
We don't need quick search, only need to find a value with key
Data should not be lost.
There are another data that doesn't have a lot of Insert/Delete unlike 1.
I'm worried about when Redis server down.
When the Redis failure, our data will be removed.
That's why I'm considering to select Amazon DynamoDB.
Because DynamoDB is NoSQL, so Insert/Delete is so fast(slower than Redis, but we don't need to that much speed), and store data permanently.
But I'm not sure that my thinking is right or not.
If I'm thinking wrong or don't think another important point, I'm going appreciate when you guys teach me.
Thanks.
There are two type of Redis deployment in AWS ElastiCache service:
Standalone
Multi-AZ cluster
With standalone installation it is possible to turn on persistence for a Redis instance, so service can recover data after reboot. But in some cases, like underlying hardware degradation, AWS can migrate Redis to another instance and lose persistent log.
In Multi-AZ cluster installation it is not possible to enable persistence, only replication is occur. In case of failure it takes a time to promote replica to master state. Another way is to use master and slave endpoints in the application directly, which is complicated. In case of failure which cause a restart both Redis node at time it is possible to lose all data of the cluster configuration too.
So, in general, Redis doesn't provide high durability of the data, while gives you very good performance.
DynamoDB is highly available and durable storage of you data. Internally it replicates data into several availability zones, so it is highly available by default. It is also fully managed AWS service, so you don't need to care about Clusters, Nodes, Monitoring ... etc, which is considering as a right cloud way.
Dynamo DB is charging by R/W operation (on-demand or reserved capacity model) and amount of stored data. In may be really cheap for testing of the service, but much more expensive under the heavy load. You should carefully analyze you workload and calculate total service costs.
As for performance: DynamoDB is a SSD Database comparing to Redis in-memory store, but it is possible to use DAX - in-memory cache read replica for DynamoDB as accelerator on heavy load. So you won't be strictly limited with the DynamoDB performance.
Here is the link to DynamoDB pricing calculator which one of the most complicated part of service usage: https://aws.amazon.com/dynamodb/pricing/

Capacity planning on AWS

I need some understanding on how to do capacity planning for AWS and what kind of infrastructure components to use. I am taking the below example.
I need to setup a nodejs based server which uses kafka, redis, mongodb. There will be 250 devices connecting to the server and sending in data every 10 seconds. Size of each data packet will be approximately 10kb. I will be using the 64bit ubuntu image
What I need to estimate,
MongoDB requires atleast 3 servers for redundancy. How do I estimate the size of the VM and EBS volume required e.g. should be m4.large, m4.xlarge or something else? Default EBS volume size is 30GB.
What should be the size of the VM for running the other application components which include 3-4 processes of nodejs, kafka and redis? e.g. should be m4.large, m4.xlarge or something else?
Can I keep just one application server in an autoscaling group and increase as them as the load increases or should i go with minimum 2
I want to generally understand that given the number of devices, data packet size and data frequency, how do we go about estimating which VM to consider and how much storage to consider and perhaps any other considerations too
Nobody can answer this question for you. It all depends on your application and usage patterns.
The only way to correctly answer this question is to deploy some infrastructure and simulate standard usage while measuring the performance of the systems (throughput, latency, disk access, memory, CPU load, etc).
Then, modify the infrastructure (add/remove instances, change instance types, etc) and measure again.
You should certainly run a minimal deployment per your requirements (eg instances in separate Availability Zones for High Availability) and you can use Auto Scaling to add extra capacity when required, but simulated testing would also be required to determine the right triggers points where more capacity should be added. For example, the best indicator might be memory, or CPU, or latency. It all depends on the application and how it behaves under load.

Kafka on AWS with Raid-0 striping

AWS confluent quickstart configures Kafka log.dirs with 4 512GB EBS block devices with RAID-0 striping for higher throughput and also helps bypass the 1TB limit of block devices without provisioned IOPS. I have just learned that losing a block device in a RAID-0 group will cause all other devices in that group to fail, can someone help clarify this
Now that Kafka allows multiple directories under log.dirs, can we mount each block device under a different mount point and configure them as a list of directories under log.dirs?
If that is possible(which it is, I guess), what are the trade-offs?
A couple things to note.
First, there isn't a 1TB limit on EBS volumes. As of the moment, Amazon st1 volumes can be as big as 16TB. These are the kind of volumes you want to use in your Kafka deployment because they're optimized for sequential writes, which is what Kafka does best.
Secondly, yes--Kafka allows for multiple log directories. This allows you to spread storage across disks so that you're not overtaxing a single disk with all of your io. That said, having multiple log directories is going to be better than having a single directory, especially if you're dealing with large amounts of data--but there are other factors to keep in mind, too, when dealing with EBS. If you're opting for smaller st1 volumes rather than a monolithic st1 volume, that means you'll have a smaller burst bucket and a lower iops baseline per volume. Once you go over your iops baseline, you'll start consuming iops from your bucket--see details here. It's important to monitor your burst balance in CloudWatch to make sure it's not being routinely depleted, which usually results in your whole cluster slowing down and your broker's request and response queues filling up, which could lead to catastrophic failures across consumer and producer applications.
As for RAID striping, if you enable it on each of your EBS volumes, all of your mounted volumes will be in the same RAID group, which means that Kafka log files will be spread across devices in the group rather than residing on a single device, the consequence of which is that if one of those devices fails, the other devices in the group will fail, too. This is supposed to be more performant than other setups, however.
Before Kafka 1.0 there was no operational difference between a single disk failing on a broker and every disk failing on that broker--both would result in the broker going down. See discussion here.
Update: As of Kafka 1.0, a failed disk will not bring down the broker (see docs). Thanks to #RobinMoffat for pointing out. Ultimately with RAID-0 striping, you're trading the ability to quickly recover from a failed disk for overall io performance. That is, all partitions on a broker with a single failed disk will need to be reassigned with striping, but without striping, only those partitions on the failed disk will need to be reassigned.

Which aws instance type is optimal to improve spark shuffle performance?

For my spark application I'm trying to determine whether I should be using 10 r3.8xlarge or 40 r3.2xlarge. I'm mostly concerned with shuffle performance of the application.
If I go with r3.8xlarge I will need to configure 4 worker instances per machine to keep the JVM size down. The worker instances will likely contend with each other for network and disk I/O if they are on the same machine. If I go with 40 r3.2xlarge I will be able to allocate a single worker instance per box, allowing each worker instance to have its own dedicated network and disk I/O.
Since shuffle performance is heavily impacted by disk and network throughput, it seems like going with 40 r3.2xlarge would be the better configuration between the two. Is my analysis correct? Are there other tradeoffs that I'm not taking into account? Does spark bypass the network transfer and read straight from local disk if worker instances are on the same machine?
Seems you have the answer already : it seems like going with 40 r3.2xlarge would be the better configuration between the two.
Recommend you go through aws well architect.
General Design Principles
The Well-Architected Framework identifies a set of general design principles to
facilitate good design in the cloud:
Stop guessing your capacity needs: Eliminate guessing your
infrastructure capacity needs. When you make a capacity decision before
you deploy a system, you might end up sitting on expensive idle resources
or dealing with the performance implications of limited capacity. With
cloud computing, these problems can go away. You can use as much or as
little capacity as you need, and scale up and down automatically.
Test systems at production scale: In a traditional, non-cloud
environment, it is usually cost-prohibitive to create a duplicate
environment solely for testing. Consequently, most test environments are
not tested at live levels of production demand. In the cloud, you can create
a duplicate environment on demand, complete your testing, and then
decommission the resources. Because you only pay for the test
environment when it is running, you can simulate your live environment
for a fraction of the cost of testing on premises.
refer:
AWS Well-Architected Framework