Emr 4.2 Spot Prices on Core Nodes - amazon-web-services

Now that EMR supports downsizing of Core nodes on EMR, if I create an EMR cluster with 1 of the core nodes as a spot instance. What happens when the spot price exceeds the bid price for my core node? Will it gracefully decomission that core node?
Here is Amazon's description of the process of shrinking the number of core nodes:
On core nodes, both YARN NodeManager and HDFS DataNode daemons must be
decommissioned in order for the instance group to shrink. For YARN,
graceful shrink ensures that a node marked for decommissioning is only
transitioned to the DECOMMISIONED state if there are no pending or
incomplete containers or applications. The decommissioning finishes
immediately if there are no running containers on the node at the
beginning of decommissioning.
For HDFS, graceful shrink ensures that the target capacity of HDFS is
large enough to fit all existing blocks. If the target capacity is not
large enough, only a partial amount of core instances are
decommissioned such that the remaining nodes can handle the current
data residing in HDFS. You should ensure additional HDFS capacity to
allow further decommissioning. You should also try to minimize write
I/O before attempting to shrink instance groups as that may delay the
completion of the resize operation.
Another limit is the default replication factor, dfs.replication
inside /etc/hadoop/conf/hdfs-site. Amazon EMR configures the value
based on the number of instances in the cluster: 1 with 1-3 instances,
2 for clusters with 4-9 instances, and 3 for clusters with 10+
instances. Graceful shrink does not allow you to shrink core nodes
below the HDFS replication factor; this is to prevent HDFS from being
unable to close files due insufficient replicas. To circumvent this
limit, you must lower the replication factor and restart the NameNode
daemon.

I think it might not be possible to gracefully decommission the node in case of Spot price spike (general case with N core nodes). There is a 2 minute notification possible before the Spot Instance is removed due to price spike. Even if captured, this time period might not be sufficient to guarantee decommission of HDFS data.
Also, with only 1 core node in cluster, decommissioning does not make much sense. The data held in the cluster needs to be moved to other nodes, which are not available in this case. Once the only available core node is lost, there needs to be a way to bring one back, else the cluster cannot run any tasks.
Shameless plug :) : I work for Qubole!
The following 2 blog posts might be useful around integration of Spot instances with Hadoop clusters including dealing with Spot price spikes.
https://www.qubole.com/blog/product/riding-the-spotted-elephant
https://www.qubole.com/blog/product/rebalancing-hadoop-higher-spot-utilization

Related

How to autoscale Cassandra cluster on AWS or how to node dynamically to the cluster depending upon the load

I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances.
What steps do I need to perform to add/remove nodes to the cluser dynamically based on load on application?
Unfortunately scaling up and down responding to the current load is not straightforward, and if you have a cluster with a large amount of data, this won't be possible:
you can't add multiple nodes simultaneously to a cluster, all the
operations need to be sequential.
adding or removing a node will require to stream data in or out of
the node; this will depend on the size of your data, as well as the
EC2 instance type you are using (for the network bandwidth limit);
also, there will be differences if you are utilizing instance
storage or EBS (EBS will limit you in IOPS)
You mentioned that you are using AWS and a replication factor of 3,
are you also using different availability zones (AZ's)? if you are,
the EC2Snitch will work to ensure that the information is balanced
between them, in order to be resilient, when you are scaling up and
down you will need to keep an even distribution between AZ's.
The scale operations will cause a rearrangement in the distribution
of tokens, once that it is completed you will need to do a cleanup
(nodetool cleanup) to remove data that is not in use anymore by
the node; this operation will also take time. This is important to
keep in mind if you are scaling up because you are running out
space.
For our use case, we are getting good results taking a proactive approach, we have set up an aggressive alert/monitoring strategy to have an early detection, so we can start the scale up operations before there is any performance impact. If your application or use case has a predictable pattern of usage can also help you to take actions in preparation of periods of high workloads.

Unhealthy node on the cluster

What are all reasons due to which a node on cluster goes in unhealthy state?
Based on my limited understanding it generally happens when the HDFS utilization on the given node goes beyond a threshold value. This threshold value is defined with max-disk-utilization-per-disk-percentage property.
I have observed at times when a memory intensive spark job is triggered on spark-sql or using pyspark nodes go to unhealthy state. Upon further looking I did ssh on the node that was in unhealthy state and discovered that actually dfs utilization was less that 75% and the value that was set for the above mentioned property was 99 on my cluster.
So I presume there is some other fact that I am missing which basically causes this behavior.
Thanks in advance for your help.
Manish Mehra
YARN Nodemanager on each hadoop node(slave) will mark the node unhealthy based on heuristics determined by health-checker. By default it will be Disk checker. If set, it can also an external health checker.
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#Health_checker_service
The default Disk Checker checks the free disk space on node and is if the disk(s) go over 90% it will mark the node unhealthy. (which is default and set in yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage )
In your case, You seem to be checking HDFS usage which span across nodes. You need to verify the disk utilization on individual nodes using "df -h" to check the disk usage on that node. If you see a volume like /mnt/ going over 99% , then it will be marked unhealthy.
You will need to find out top directories occupying most disk space and make appropriate actions accordingly. HDFS, which will use the disk(s) on the nodes (set using dfs.data.dir), can cause the nodes unhealthy if its utilization is very high during a job run. However, the nodes can go unhealthy without a high HDFS utilization.

Updating an AWS ECS Service

I have a service running on AWS EC2 Container Service (ECS). My setup is a relatively simple one. It operates with a single task definition and the following details:
Desired capacity set at 2
Minimum healthy set at 50%
Maximum available set at 200%
Tasks run with 80% CPU and memory reservations
Initially, I am able to get the necessary EC2 instances registered to the cluster that holds the service without a problem. The associated task then starts running on the two instances. As expected – given the CPU and memory reservations – the tasks take up almost the entirety of the EC2 instances' resources.
Sometimes, I want the task to use a new version of the application it is running. In order to make this happen, I create a revision of the task, de-register the previous revision, and then update the service. Note that I have set the minimum healthy percentage to require 2 * 0.50 = 1 instance running at all times and the maximum healthy percentage to permit up to 2 * 2.00 = 4 instances running.
Accordingly, I expected 1 of the de-registered task instances to be drained and taken offline so that 1 instance of the new revision of the task could be brought online. Then the process would repeat itself, bringing the deployment to a successful state.
Unfortunately, the cluster does nothing. In the events log, it tells me that it cannot place the new tasks, even though the process I have described above would permit it to do so.
How can I get the cluster to perform the behavior that I am expecting? I have only been able to get it to do so when I manually register another EC2 instance to the cluster and then tear it down after the update is complete (which is not desirable).
I have faced the same issue where the tasks used to get stuck and had no space to place them. Below snippet from AWS doc on updating a service helped me to make the below decision.
If your service has a desired number of four tasks and a maximum
percent value of 200%, the scheduler may start four new tasks before
stopping the four older tasks (provided that the cluster resources
required to do this are available). The default value for maximum
percent is 200%.
We should have the cluster resources available / container instances available to have the new tasks get started so they can start and the older one can drain.
These are the things i do
Before doing a service update add like 20% capacity to your cluster. You can use the ASG (Autoscaling group) commandline and from the desired capacity add 20% to your cluster. This way you will have some additional instance during deployment.
Once you have the instance the new tasks will start spinning up quickly and the older one will start draining.
But does this mean i will have extra container instances ?
Yes, during the deployment you will add some instances but as the older tasks drain they will hang around. The way to remove them is
Create a MemoryReservationLow alarm (~70% threshold in your case) for like 25 mins (longer duration to be sure that we have over commissioned). As the reservation will go low once you have those extra server not being used they can be removed.
I have seen this before. If your port mapping is attempting to map a static host port to the container within the task, you need more cluster instances.
Also this could be because there is not enough available memory to meet the memory (soft or hard) limit requested by the container within the task.

add node to existing aerospike cluster using autoscale

can i add aerospike cluster under aws autoscale? Like . my initial autoscale group size will 3, if more traffic comes in and if cpu utilization is greater then 80% then it will add another instance into the cluster. do you think it is possible? and does it has any disadvantage or will create any problem in cluster?
There's an Amazon CloudFormation script at aerospike/aws-cloudformation that gives an example of how to launch such a cluster.
However, the point of autoscale is to grow shared-nothing worker nodes, such as webapps. These nodes typically don't have any shared data on them, you simply launch a new one and it's ready to work.
The point of adding a node to a distributed database like Aerospike is to have more data capacity, and to even out the data across more nodes, which gives you an increased ability to handle operations (reads, writes, etc). Autoscaling Aerospike would probably not work as you expect it. This is because of the fact that when a node is added to the cluster a new (larger) cluster is formed, and the data is automatically balanced. Part of balancing is migrating partitions of data between nodes, and it ends when the number of partitions across each node is even once again (and therefore the data is evenly spread across all the nodes of the cluster). Migrations are heavy, taking up network bandwidth.
This would work if you could time it to happen ahead of the traffic peaking, because then migrations could be completed ahead of time, and your cluster would be ready for the next peak. You would not want to do this as peak traffic is occuring, because it would only make things worse. You also want to make sure that when the cluster contracts there is enough room for the data, enough DRAM for the primary-index, as the per-node usage of both will grow.
One more point of having extra capacity in Aerospike is to allow for rolling upgrades, where one node goes through upgrade at a time without needing to take down the entire cluster. Aerospike is typically used for realtime applications that require no downtime. At a minimum your cluster needs to be able to handle a node going down and have enough capacity to pick up the slack.
Just as a note, you have fine grain configuration control over the rate in which migrations happen, but they run longer if you make the process less aggressive.

It is possible use AutoScaling with Elastic Mapreduce?

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?