I have set up an AWS EMR cluster with Spark 1.4. I have set up one master node and two slave nodes. Looking at the load distribution, it seems like one slave is always maxed out while the other one is not doing much. Has anyone faced similar issue? What might be causing this?
Note: I am trying to run Spark MLLib for generating recommendation. So it pulls data from Elasticsearch and does recommendation computation using Spark. One slave is always maxed out on Network usage while the other seems to be using minimal resource and almost idle. The master is using 10 GB of network while each slave is using 1 GB.
Related
We are running an EMR cluster with spot instances as task nodes. The EMR cluster is executing spark jobs which sometimes run for several hours. Interruptions of spot instances can cause the failure of the spark job which then requires us to restart the job entirely.
I can see that there is some basic information on the "Frequency of interruption" on AWS Spot Advisor - However, this data seems to be very generic, I can't see historic trends and I also miss the probability of interruption based on how long the spot instance is running (which should have a significant impact on the probability of interruption).
Is this data available somewhere? Or are there other data points that can be used as proxy?
I found this Github issue which provides a link to this JSON file in Spot Advisor S3 bucket that includes interruption rates.
https://spot-bid-advisor.s3.amazonaws.com/spot-advisor-data.json
I create a Kubernetes (v1.6.1) cluster on AWS with one master and two slave nodes, then I spin up mysql instance using helm and deploy a simple Django web-app that queries latest five rows from the database and displays it. For my web service I specify 'type: LoadBalancer' which creates an ELB on AWS.
If I use 'weave' networking and scale my web-app to at least two replicas, then I begin experiencing inconsistent response time - most of the time it is reasonable (like 0.1-0.2 s), but 20-40% requests take significantly longer (3-5 s, sometimes even more than 15 s). However, if I switch to 'flannel' networking, everything works fast, even with 20-30 replicas of the web-app. All machines have enough resources, so that's not the problem.
I tried debugging to find out what's causing the delay, and the best explanation I have is that AWS ELB doesn't work well with 'weave'. Has anyone experienced similar issues? What could be the problem? Please let me know if I should provide some relevant information.
P.S. I'm new to using Kubernetes.
I have an application hosted in AWS, where the application's data is stored in RIAK KV cluster; there are 5 nodes forms this cluster.
To meet high demand and availability constrains, i would like to replicate the complete setup in another AWS region (as active-active), where yet another RIAK KV cluster will be created with upto 5 node.
Now the question is, How do i sync the data between these 2 RIAK cluster which are running in 2 different AWS regions?
Since the opensource/commercial version of RIAK KV does not provide multi region clustering capability, How do i sync data between these clusters?
The Enterprise version of Riak KV has multi-cluster/datacenter replication built in (as you note in your question). This form of replication does some pretty clever things to ensure that data copied to both clusters remains in synch when updated as well as recovering from things like data center failure and split brain conditions.
If you want to roll your own replication there are quite a few ways that you might approach it including:
Dual write - have your application send writes to both clusters in parallel;
Post-commit hooks (http://docs.basho.com/riak/kv/latest/developing/usage/commit-hooks/) - after date gets written to one cluster successfully use a post-commit hook to replicate the write to the other cluster
The primary weakness of these solutions is that you still need to figure out how to keep data in synch across the clusters under failure conditions.
I know that there are more than a handful of Riak KV open source users who have rolled various in house replication mechanisms so hopefully one of them will chime in with what they have done.
I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.
will using i mirco instance for master and cores affect performance?
I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.
All computation will be done in memory by R3.xlarge spot instances as task nodes anyway.
And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?
I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :
You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
Now to answer both of your questions :
will using i micro instance for master and cores affect performance?
Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.
does spark utilise multiple cores in each machine?
Yes Spark uses multiple cores on each machine.
You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.
The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.
I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.
I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?