Hadoop Balancer parameters doubts

Hadoop Balancer parameters doubts - hdfs

Can anyone elaborately explain the usage of this command?
hdfs balancer [-source [-f <hosts-file> | <comma-separated list of hosts>]]
Issue in my cluster is that one of my datanode is utilising 93% of block pools of its capacity and the rest are using <80%.

From HortonWorks documentation :
The new -source option allows specifying a source datanode list so that the Balancer selects blocks to move from only those datanodes. When the list is empty, all the datanodes can be chosen as a source. The default value is an empty list.
The option can be used to free up the space of some particular datanodes in the cluster. Without the -source option, the Balancer can be inefficient in some cases. Below is an example.
Datanodes (with the same capacity) Utilization Rack
D1 95% A
D2 30% B
D3, D4, D5 0% B
In the table above, the average utilization is 25% so that D2 is within the 10% threshold. It is unnecessary to move any blocks from or to D2. Without specifying the source nodes, Balancer first moves blocks from D2 to D3, D4 and D5, since they are under the same rack, and then moves blocks from D1 to D3, D4 and D5. By specifying D1 as the source node, Balancer directly moves blocks from D1 to D3, D4 and D5.
This is the second article of the HDFS Balancer serious. We will explain the algorithm deployed by the Balancer for balancing the cluster in the next article.

Related

How to properly check resource usage of AWS EMR cluster(master and cores) from notebook

Here are my cluster details:
Master : Running 1 m4.xlarge
Core : Running 3 m4.xlarge
Task : --
Cluster scaling: Not enabled
And I am using notebooks to practice pyspark. And I would like to know how the resources are being utilised, to assess if the resources are being under-utilised or not enough for my tasks. And part of which, when checking RAM/memory usage, here's what I got from terminal:
notebook#ip-xxx-xxx-xxx-xxx ~$ free -h
total used free shared buff/cache available
Mem: 1.9G 456M 759M 72K 741M 1.4G
Swap: 0B 0B 0B
Each instance of m4.xlarge comes with 16GB of memory. What's happening and why is only two gigs of 16GB being shown? And How do I properly learn how much of my CPU, Memory and Storage are actually being used? (yes, to reduce costs!!)

If you want to check memory and CPU utilization you can check that in CloudWatch with the instance Id.
To get the instance id of the node (Hardware -> Instance Group -> Instances).
You can get detailed metrics of CPU, memory, IO for each node.
Another option to use resource manager UI (YARN). default url -> http://master-node-ip:8088.
You can get metrics on job level as well as node level.

To start reducing costs you can use m5 or m6g instances and also consider using spot instances. Also, you can use the metrics in EMR console in Monitoring tab. Container pending graph it's a good way to start monitoring. If you dpn't have any container pending and you have executors without tasks running inside you are wasting resources(CPU).

AWS Load Balancer + NginX + EC2 AutoScaling Group

Currently, I dont have AWS Load Balancer setup yet.
Request comes to a single ec2 instance: first hits nginx which then gets forwarded to node/express.
Now, I want to create an autoscaling group, and attach AWS load balancer to distribute the request that comes in. I am wondering if this is a good setup:
Request -> AWS Load Balancer -> Nginx A + EC2 A
-> Nginx B + EC2 B
-> ... C + ... C
Nginx is installed on the same EC2 that has node.js running on it. Nginx config has logic to detect user's location using the geoip module, as well as gzip compression configs and ssl handling.
I will also move the ssl handling to the load balancer.

Ideally (if Nginx can be decoupled from specific Node tasks) you'd want an auto scaling group dedicated to each service and I'd suggest using containerization for this because this is exactly what its meant for, though all this will obviously require some non-trivial changes to your program...
This will enable...
Efficient Resource Allocation
Select instance types with the ideal mix of CPU/RAM/Network/Storage per service (Node or Nginx)
Maintain granular control over the amount of tasks running relative their actual demand.
Intelligent Scaling
Thresholds set to initialize scaling actions need to reflect the resources they're running. You may not want to say, double your more compute intensive Node capacity, when there are spikes in simple read operations to your program. By segmenting the services by resources the thresholds can be tied to the resource your service demands the most of. You may want to scale...
Nginx based on maximum inbound requests over 1 minute period
Node based on average CPU Utilization over 5 minute period
The 'chunks' that your instances are broken into relative to the size of you tasks also makes a big difference in how efficient they will scale. Here's an exaggerated example, on just one service...
1 EC2 t3.large running 5 Node tasks # 50% RAM Utilization.
AS group Hits 70% or whatever thresholds you assigned, scales-out 1 instance
2 instances now running say 6 Node tasks # 30% RAM Utilization
This causes 2 problems...
You're now wasting a lot of money
Possibly more importantly... what is you scale-in threshold? 20% Utilization?
The tighter the gaps of you upper and lower scaling bounds the more efficient you'll be. When the tasks you're running are all homogenous you can add and remove in smaller and more precise 'chunks' of resources.
In the scenario above you'd ideally want something like...
3 t3.small instances running 5 Node tasks
AS group hits 70% Utilization, scales-out 1 instance
Now you have 6 tasks on 4 instances at 50% utilization
Utilization drops to 40% scale-in 1 instance.
You can obviously still do all of this running Node and Nginx on the same underlying resources, but the mathematics of it all gets pretty crazy and makes your system brittle.
(I've simplified the above to target Memory Utilization on the AS group, but in application you'd have ECS adding tasks based on Utillzation, which then adds to the memory Reservation of the cluster, which would then initiate the AS actions.)
Simplified & Efficient Deployment
You don't want to be redeploying your whole Node code base for every update to Nginx configuration.
Simplifies Blue/Green deployments testing and rollbacks.
Minimize the resources you have to spin up for the 'Blue' portion of you deployments.
Use customized AMIs with pre installed binaries if needed fro only the serivce dependent on them
Whether you want to do it immediately or not (and you will) this configuration will allow you to move to Spot Instances to handle more of your variable workloads. Like all of this, you can still use Spot Instances with the configuration you've laid out, but handling termination procedures efficiently and without disruptions is a whole other mess and when you get to that you want the rest of this very organized and working smoothly.
ECS
NLB
I don't know what you're using for deployment, but AWS CodeDeploy will work beautifully with ECS to manage you container clusters as well.

Autoscaling a Cassandra cluster on AWS

I have been trying to auto-scale a 3node Cassandra cluster with Replication Factor 3 and Consistency Level 1 on Amazon EC2 instances. Despite the load balancer one of the autoscaled nodes has zero CPU utilization and the other autoscaled node has considerable traffic on it.
I have experimented more than 4 times to auto-scale a 3 node with RF3CL1 and the CPU utilization on one of the autoscaling nodes is still zero. The overall CPU utilization has a drop but one of the autoscaled nodes is consistently idle from the point of auto scaling.
Note that the two nodes which are launched at the point of autoscaling are started by the same launch configuration. The two nodes have the same configuration in every aspect. There is an alarm for the triggering of the nodes and the scaling policy is set as per that alarm.
Can there be a bash script that can be run on the user data?
For example, altering the keyspaces?
Can someone let me know what could be the reason behind this behavior?

AWS auto scaling and load balancing is not a good fit for Cassandra. Cassandra has its own built in clustering with seed nodes to discover the other members of the cluster, so there is no need for an ELB. And auto scaling can screw you up because the data has to be re-balanced between the nodes.
https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

yes, you don't need ELB for Cassandra.
So you created a single node Cassandra, and created some keyspace. Then you scaled Cassandra to three nodes. You found one new node was idle when accessing the existing keyspace. Is this understanding correct? Did you alter the existing keyspace's replication factor to 3? If not, the existing keyspace's data will still have 1 replica.
When adding the new nodes, Cassandra will automatically balance some tokens to the new nodes. This is probably why you are seeing load on one of the new nodes, which happens to get some tokens that has keyspace data.

Unhealthy node on the cluster

What are all reasons due to which a node on cluster goes in unhealthy state?
Based on my limited understanding it generally happens when the HDFS utilization on the given node goes beyond a threshold value. This threshold value is defined with max-disk-utilization-per-disk-percentage property.
I have observed at times when a memory intensive spark job is triggered on spark-sql or using pyspark nodes go to unhealthy state. Upon further looking I did ssh on the node that was in unhealthy state and discovered that actually dfs utilization was less that 75% and the value that was set for the above mentioned property was 99 on my cluster.
So I presume there is some other fact that I am missing which basically causes this behavior.
Thanks in advance for your help.
Manish Mehra

YARN Nodemanager on each hadoop node(slave) will mark the node unhealthy based on heuristics determined by health-checker. By default it will be Disk checker. If set, it can also an external health checker.
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html#Health_checker_service
The default Disk Checker checks the free disk space on node and is if the disk(s) go over 90% it will mark the node unhealthy. (which is default and set in yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage )
In your case, You seem to be checking HDFS usage which span across nodes. You need to verify the disk utilization on individual nodes using "df -h" to check the disk usage on that node. If you see a volume like /mnt/ going over 99% , then it will be marked unhealthy.
You will need to find out top directories occupying most disk space and make appropriate actions accordingly. HDFS, which will use the disk(s) on the nodes (set using dfs.data.dir), can cause the nodes unhealthy if its utilization is very high during a job run. However, the nodes can go unhealthy without a high HDFS utilization.

Graphite over spot instances scaling

i have over a hundred servers that are sending metrics to my statsd-graphite setup. A leaf out of the metric subtree is something like-
stats.dev.medusa.ip-10-0-30-61.jaguar.v4.outbox.get
stats.dev.medusa.ip-10-0-30-62.jaguar.v4.outbox.get
Most of my crawlers are AWS spot-instances, which implies that 20s of them go down and up randomly, being allocated different IP addresses every time. This implies that the same list becomes-
stats.dev.medusa.ip-10-0-30-6.<|subtree
stats.dev.medusa.ip-10-0-30-1.<|subtree
stats.dev.medusa.ip-10-0-30-26.<|subtree
stats.dev.medusa.ip-10-0-30-21.<|subtree
Assuming all the metrics under the subtree in total store 4G metrics, 20 spot instances going down and later 30 of them spawning up with different IP addresses implies that my storage suddenly puffs up by 120G. Moreover, this is a weekly occurrence.
While it is simple and straightforward to delete the older IP-subtrees, but i really want to retain the metrics. i can have 3 medusas at week0, 23 medusas at week1, 15 in week2, 40 in week4. What can be my options? How would you tackle this?

We achieve this by not logging the ip address. Use a deterministic locking concept and when instances come up they request a machine id. They can then use this machine id instead of the ip address for the statsd bucket.
stats.dev.medusa.machine-1.<|subtree
stats.dev.medusa.machine-2.<|subtree
This will mean you should only have up to 40 of these buckets. We are using this concept successfully, with a simple number allocator api on a separate machine that allocates the instance numbers. Once a machine has an instance number it stores it as a tag on that machine, so our allocator can query the tags of the ec2 instances to see what is being used at the moment. This allows it to re-allocate old machine ids.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js