What are all reasons due to which a node on cluster goes in unhealthy state?
Based on my limited understanding it generally happens when the HDFS utilization on the given node goes beyond a threshold value. This threshold value is defined with max-disk-utilization-per-disk-percentage property.
I have observed at times when a memory intensive spark job is triggered on spark-sql or using pyspark nodes go to unhealthy state. Upon further looking I did ssh on the node that was in unhealthy state and discovered that actually dfs utilization was less that 75% and the value that was set for the above mentioned property was 99 on my cluster.
So I presume there is some other fact that I am missing which basically causes this behavior.
Thanks in advance for your help.
Manish Mehra
YARN Nodemanager on each hadoop node(slave) will mark the node unhealthy based on heuristics determined by health-checker. By default it will be Disk checker. If set, it can also an external health checker.
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#Health_checker_service
The default Disk Checker checks the free disk space on node and is if the disk(s) go over 90% it will mark the node unhealthy. (which is default and set in yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage )
In your case, You seem to be checking HDFS usage which span across nodes. You need to verify the disk utilization on individual nodes using "df -h" to check the disk usage on that node. If you see a volume like /mnt/ going over 99% , then it will be marked unhealthy.
You will need to find out top directories occupying most disk space and make appropriate actions accordingly. HDFS, which will use the disk(s) on the nodes (set using dfs.data.dir), can cause the nodes unhealthy if its utilization is very high during a job run. However, the nodes can go unhealthy without a high HDFS utilization.
Related
I'm evaluating Karpenter (https://karpenter.sh/) and I wanted to know if there's a way to vertically scale down a large node with few pods. The only scaling actions seem to be triggered by either unschedulable pods or empty nodes.
Scenario: I scheduled 5 pods and the scheduler gave me one c5d.2xlarge instance, and that resulted in a 65% utilization (not bad). I killed 3 pods and utilization dropped as expected to 25%. I waited for a few hours to see if an optimization process would kick in but .. nothing (over 20 hours). The feature is not well documented, in fact the only reference of it is in this independent article: https://blog.sivamuthukumar.com/karpenter-scaling-nodes-seamlessly-in-aws-eks
How does it work?
Observes the pod resource requests of unscheduled pods
Direct provision of Just-in-time capacity of the node. (Groupless Node
Autoscaling)
Terminating nodes if outdated
Reallocating the pods in nodes for better resource utilization
Am I missing something? Is there a way to do this, using Karpenter or another solution? TIA
So there's a feature request on Karpenter's Github project addressing this specific issue: https://github.com/aws/karpenter/issues/1091. I'll update this answer once a solution is available.
The workaround suggested by the project team, was to set a short TTL on the nodes (like 1 day), forcing Karpenter to evaluate optimization daily.
In GKE, for cost-saving, I usually put the node number to zero. When I autoscale nodes(or say add) and run the pods. It takes more than 6-7 mins to connect to Loadbalancer and up the URL. That's why health checks in the waiting state. Is there any way to reduce the time? Thanks
If Cloud Functions is not an option, you might want to look at Cloud Run (which supports containers and scales to zero) or GKE Autopilot (which does not scale to zero, but you can scale down to low resource and it will autoscale up and down as needed)
In short not really. Spinning up time of nodes is not easily controlled, basically it is the time that will take for the VM to be allocated, turned on, boot the OS and do some other stuff related to Kubernetes (like configuration, adding to node pool, etc) this takes time! In addition to Pods spinning up time which depends on the Docker image (size/dependencies etc).
Scaling down your application to zero nodes is not very recommended. It is always recommended to have some nodes up (don’t you have other apps running on the GKE cluster? Kubernetes clusters are recommended to have at least 3 nodes running).
Have you considered using Cloud Functions? Is it possible in your case? This the the best option I know of for a quick scale up and zero scale down.
And in general you can keep some kind of “ping” to the function to keep it “hot” for a relatively cheap price.
If none of the options above is possible (id say keeping your node pool with at least 3 nodes operating, is best as it is takes time for the Kubernetes control plan to boot). I suggest starting with reducing the spinning up time of your Pods by improving the Docker image - reducing its size etc.
Here are some articles on how to reduce Docker image size
https://phoenixnap.com/kb/docker-image-size
https://www.ardanlabs.com/blog/2020/02/docker-images-part1-reducing-image-size.html
After that I will experiment with different machine types for node to check which one is spinning the fastest - could be an interesting thing to do in any case
Here is an interesting comparison on VM spinning up times
https://www.google.com/amp/s/blog.cloud66.com/part-2-comparing-the-speed-of-vm-creation-and-ssh-access-on-aws-digitalocean-linode-vexxhost-google-cloud-rackspace-packet-cloud-a-and-microsoft-azure/amp/
I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances.
What steps do I need to perform to add/remove nodes to the cluser dynamically based on load on application?
Unfortunately scaling up and down responding to the current load is not straightforward, and if you have a cluster with a large amount of data, this won't be possible:
you can't add multiple nodes simultaneously to a cluster, all the
operations need to be sequential.
adding or removing a node will require to stream data in or out of
the node; this will depend on the size of your data, as well as the
EC2 instance type you are using (for the network bandwidth limit);
also, there will be differences if you are utilizing instance
storage or EBS (EBS will limit you in IOPS)
You mentioned that you are using AWS and a replication factor of 3,
are you also using different availability zones (AZ's)? if you are,
the EC2Snitch will work to ensure that the information is balanced
between them, in order to be resilient, when you are scaling up and
down you will need to keep an even distribution between AZ's.
The scale operations will cause a rearrangement in the distribution
of tokens, once that it is completed you will need to do a cleanup
(nodetool cleanup) to remove data that is not in use anymore by
the node; this operation will also take time. This is important to
keep in mind if you are scaling up because you are running out
space.
For our use case, we are getting good results taking a proactive approach, we have set up an aggressive alert/monitoring strategy to have an early detection, so we can start the scale up operations before there is any performance impact. If your application or use case has a predictable pattern of usage can also help you to take actions in preparation of periods of high workloads.
Now that EMR supports downsizing of Core nodes on EMR, if I create an EMR cluster with 1 of the core nodes as a spot instance. What happens when the spot price exceeds the bid price for my core node? Will it gracefully decomission that core node?
Here is Amazon's description of the process of shrinking the number of core nodes:
On core nodes, both YARN NodeManager and HDFS DataNode daemons must be
decommissioned in order for the instance group to shrink. For YARN,
graceful shrink ensures that a node marked for decommissioning is only
transitioned to the DECOMMISIONED state if there are no pending or
incomplete containers or applications. The decommissioning finishes
immediately if there are no running containers on the node at the
beginning of decommissioning.
For HDFS, graceful shrink ensures that the target capacity of HDFS is
large enough to fit all existing blocks. If the target capacity is not
large enough, only a partial amount of core instances are
decommissioned such that the remaining nodes can handle the current
data residing in HDFS. You should ensure additional HDFS capacity to
allow further decommissioning. You should also try to minimize write
I/O before attempting to shrink instance groups as that may delay the
completion of the resize operation.
Another limit is the default replication factor, dfs.replication
inside /etc/hadoop/conf/hdfs-site. Amazon EMR configures the value
based on the number of instances in the cluster: 1 with 1-3 instances,
2 for clusters with 4-9 instances, and 3 for clusters with 10+
instances. Graceful shrink does not allow you to shrink core nodes
below the HDFS replication factor; this is to prevent HDFS from being
unable to close files due insufficient replicas. To circumvent this
limit, you must lower the replication factor and restart the NameNode
daemon.
I think it might not be possible to gracefully decommission the node in case of Spot price spike (general case with N core nodes). There is a 2 minute notification possible before the Spot Instance is removed due to price spike. Even if captured, this time period might not be sufficient to guarantee decommission of HDFS data.
Also, with only 1 core node in cluster, decommissioning does not make much sense. The data held in the cluster needs to be moved to other nodes, which are not available in this case. Once the only available core node is lost, there needs to be a way to bring one back, else the cluster cannot run any tasks.
Shameless plug :) : I work for Qubole!
The following 2 blog posts might be useful around integration of Spot instances with Hadoop clusters including dealing with Spot price spikes.
https://www.qubole.com/blog/product/riding-the-spotted-elephant
https://www.qubole.com/blog/product/rebalancing-hadoop-higher-spot-utilization
can i add aerospike cluster under aws autoscale? Like . my initial autoscale group size will 3, if more traffic comes in and if cpu utilization is greater then 80% then it will add another instance into the cluster. do you think it is possible? and does it has any disadvantage or will create any problem in cluster?
There's an Amazon CloudFormation script at aerospike/aws-cloudformation that gives an example of how to launch such a cluster.
However, the point of autoscale is to grow shared-nothing worker nodes, such as webapps. These nodes typically don't have any shared data on them, you simply launch a new one and it's ready to work.
The point of adding a node to a distributed database like Aerospike is to have more data capacity, and to even out the data across more nodes, which gives you an increased ability to handle operations (reads, writes, etc). Autoscaling Aerospike would probably not work as you expect it. This is because of the fact that when a node is added to the cluster a new (larger) cluster is formed, and the data is automatically balanced. Part of balancing is migrating partitions of data between nodes, and it ends when the number of partitions across each node is even once again (and therefore the data is evenly spread across all the nodes of the cluster). Migrations are heavy, taking up network bandwidth.
This would work if you could time it to happen ahead of the traffic peaking, because then migrations could be completed ahead of time, and your cluster would be ready for the next peak. You would not want to do this as peak traffic is occuring, because it would only make things worse. You also want to make sure that when the cluster contracts there is enough room for the data, enough DRAM for the primary-index, as the per-node usage of both will grow.
One more point of having extra capacity in Aerospike is to allow for rolling upgrades, where one node goes through upgrade at a time without needing to take down the entire cluster. Aerospike is typically used for realtime applications that require no downtime. At a minimum your cluster needs to be able to handle a node going down and have enough capacity to pick up the slack.
Just as a note, you have fine grain configuration control over the rate in which migrations happen, but they run longer if you make the process less aggressive.