Autoscaling a running Hadoop cluster setup on AWS EC2 - amazon-web-services

My goal is to understand how can I auto-scale a Hadoop cluster on AWS EC2.
I am exploring AWS offerings from elastic scaling perspective for a Hadoop as service (EMR) and Hadoop on EC2.
For EMR, I gathered that using CloudWatch, performance metrics can be monitored and the user can be alerted once they reach the set threshold, thereafter the cluster can be scaled up or down depending on its utilization state.
This approach would require some custom implementation to automate the steps.(correct me if I am missing anything here)
For Hadoop on EC2, I came across with the auto scaling option which can add or remove instances as per configured scaling policies.
But I am not clear how a newly added node would get bootstrapped to the cluster automatically? How would YARN know that it can spawn a new container on this newly added node?
Does auto-scaling work for master-slave kind of setup as well or is limited to the web application?
There is 'Qubole' offering services to manage Hadoop on AWS as well....should that be used for automatically managing scaling the cluster?

Related

Deploying to bare EC2 instances in an ASG?

I have a service that needs to run on our own EC2 instances, since it requires some support from the kernel. My previous experience is all with containers in AWS. The application itself is distributed as a single JAR file and I'm looking for advice for how I should automate deployments. The architecture is:
An ALB in front of the ASG.
EC2 instance running a single Java application.
Any open sockets are open for an hour tops and to not cause any trouble, we have to drain the connections to the EC2 instances before performing an update, so a hard requirement is for the ALB to stop opening new connections for an hour before updating the software. The application is mission critical and ECS has had some issues last year, so I want to minimize the AWS services I depend on. While I could do what I want on my own ECS cluster with custom AMIs, I don't want to do it, since I will run a single instance of the app per host and don't need the extra layer.
My question: What is the simplest method to achieve this using CodePipeline? My understanding is that I need to use a CodeDeploy deployment step to push something to bare EC2 instances. How does draining with an ALB work in this case? We're using CloudFormation for the deployment.
You need to use codedeploy. You can find tutorial on AWS codedeploy documentation.
Codedeploy deployment lifecycle hooks for EC2.
https://docs.aws.amazon.com/codedeploy/latest/userguide/reference-appspec-file-structure-hooks.html#appspec-hooks-server

monitor EKS Cluster using AppDynamics

I have an Elastic Kubernetes Cluster(EKS) running in AWS , In the cluster many services and pods are running .I want to use AppDynamics to monitor the services and pods . I am new to AppDynamics so I don't know much about it . but i am confused in some areas
What are the performance metrics(CPU usages , no of instances... ) should I use for monitor the
cluster
How can I monitor the cluster , how to setup AWS with AppDynamics to monitor everything
The Cluster Agent is used for monitoring AWS EKS, additionally the Cluster Agent Operator can be used to setup additional Infra / Network monitoring.
Compatibility: https://docs.appdynamics.com/21.9/en/infrastructure-visibility/monitor-kubernetes-with-the-cluster-agent/cluster-agent-requirements-and-supported-environments
Install (Cluster Agent): https://docs.appdynamics.com/21.9/en/infrastructure-visibility/monitor-kubernetes-with-the-cluster-agent/install-the-cluster-agent
(You will need to grab / build an image and then install using Kubernetes CLI or the Cluster Agent Helm Chart)
Install (Infra Agent / Network Visibility - requires the Cluster Agent): https://docs.appdynamics.com/21.9/en/infrastructure-visibility/monitor-kubernetes-with-the-cluster-agent/install-the-cluster-agent/install-infrastructure-visibility-with-the-cluster-agent-operator
Metrics: https://docs.appdynamics.com/21.9/en/infrastructure-visibility/monitor-kubernetes-with-the-cluster-agent/use-the-cluster-agent/monitor-cluster-health
As to what Metrics to actively monitor this is a bit subjective, however there are plenty of guides around to help, e.g:
https://www.kubermatic.com/blog/the-complete-guide-to-kubernetes-metrics/
https://sematext.com/blog/kubernetes-metrics/

How to set up autoscaling for ECS cluster that uses scheduled tasks and no service?

I have an ECS cluster which i have an ec2 instance tied to and a scheduled task set to run daily using the 'Scheduled Tasks' functionality on the ECS dashboard.
This task runs a a bunch of containers that are each relatively expensive in memory, and this is compounded even more so with all the containers running at once.
I do not currently have a service set up for the ECS cluster and it is my understanding that for my goals, running a set task on some interval, a service would not be used.
AWS's definition of a service in there ECS docs says:
An Amazon ECS service enables you to run and maintain a specified number of instances of a task definition simultaneously in an Amazon ECS cluster.
Since this is not what I want; instead i need to just run a task on some scheduled interval i gather i do not need an service tied to my ECS cluster.
My question is on how to set up autoscaling for my scheduled tasks? The only references i can find to auto scaling within an ECS cluster are to do with creating ecs services that auto scale - which again, is not what I want (at least from how i understand ecs services to work).
What I need is for my ec2 instances to auto scale with my scheduled task running, allocating more resources as need for the task to run. Would I just need to set up auto scaling on the specific ec2 instance the ecs cluster is tied to from within the ec2 dashboard or is there some other way to do this from ECS directly?
I want; instead, I need to just run a task on some scheduled interval I gather I do not need a service tied to my ECS cluster.
For the above use case better to use fargate and you will not maintain or worry about auto-scaling scheduling, all you need to setup schedule task and AWS will take care of memory and other resources required for your task plus you will only pay for the resources that were used by you ECS task, unlike EC2 type task where you pay for the container instance.
AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design.
aws fargate
Create a cloud watch rule base on some schedule that will trigger the task, make sure that the container exit once it completes the job, fargate will automatically stop the container.
cloudwatch-event-rule-to-invoke-an-ecs-task

How to understand Amazon ECS cluster

I recently tried to deploy docker containers using task definition by AWS. Along the way, I came across the following questions.
How to add an instance to a cluster? When creating a new cluster using Amazon ECS console, how to add a new ec2 instance to the new cluster. In other words, when launching a new ec2 instance, what config is needed in order to allocate it to a user created cluster under Amazon ECS.
How many ECS instances are needed in a cluster, and what are the factors?
If I have two instances (ins1, ins2) in a cluster, and my webapp, db containers are running in ins1. After I updated the running service (through http://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html), I can see the newly created service is running in "ins2", before draining the old service in "ins1". My question is that after my webapp container allocated to another instance, the access IP address becomes another instance IP. How to prevent or what the solution to make the same IP address access to webapp? Not only IP, what about the data after changing to a new instance?
These are really three fairly different questions, so it might best to split them into different questions here accordingly - I'll try to provide an answer regardless:
Amazon ECS Container Instances are added indirectly, it's the job of the Amazon ECS Container Agent on each instance to register itself with the cluster created and named by you, see concepts and lifecycle for details. For this to work, you need follow the steps outlined in Launching an Amazon ECS Container Instance, be it manually or via automation. Be aware of step 10.:
By default, your container instance launches into your default
cluster. If you want to launch into your own cluster instead of the
default, choose the Advanced Details list and paste the following
script into the User data field, replacing your_cluster_name with the
name of your cluster.
#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
You only need a single instance for ECS to work as such, because the cluster itself is managed by AWS on your behalf. This wouldn't be sufficient for high availability scenarios though:
Because the container hosts are just regular Amazon EC2 instances, you would need to follow AWS best practices and spread them over two or three Availability Zones (AZ) so that a (rare) outage of an AZ doesn't impact your cluster, because ECS can migrate your containers to a different host instance (provided your cluster has sufficient spare capacity).
Many advanced clustering technologies that facilitate containers have their own service orchestration layers and usually require an uneven number >= 3 (service) instances for a high availability setup. You can read more about this in section Optimal Cluster Size within Administration for example (see also Running CoreOS with AWS EC2 Container Service).
This refers back to the high availability and service orchestration topics mentioned in 2. already, more precisely your are facing the problem of service discovery, which becomes more prevalent even when using container technologies in general and micro-services in particular:
To get familiar with this, I recommend Jeff Lindsay's Understanding Modern Service Discovery with Docker for an excellent overview specifically focused on your use case.
Jeff also maintains a containerized version of the increasingly popular Consul, which makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface (see Running Consul in Docker and gliderlabs/docker-consul).

Strom on Mesos in AWS

I am running a storm cluster on AWS. But I want the storm cluster to expand automatically when the need comes. I figured out mesos is something like that. But I do not have much knowledge about mesos and its deployment on AWS.
Can mesos on AWS automatically increase the parallelism of my topology tasks by launching new instances and shutting them down when not necessary? If it can, how do we configure mesos for the same.
Mesos does not directly handle autoscaling itself, but allows frameworks running on top of it to receive new resource offers and react to them by launching new task instances. I haven't used it personally, but you could try the Storm-Mesos framework for running Storm on Mesos: https://github.com/mesos/storm
Once you have Storm running on Mesos, ready to launch new instances as resources come available, you're ready to autoscale within the existing cluster's capacity. You'll probably want to take advantage of Amazon's Auto-Scaling Groups (ASGs) to scale up the number of Mesos nodes based on your need. As the ASG scales up more Mesos nodes, the resources from those nodes will be automatically offered to the Storm-Mesos framework, which can launch more Storm instances.
Yes, you're heading in the right direction. However I'd suggest to use Marathon rather than the low-level Mesos API.
See for example the GitHub repo obaidsalikeen/storm-marathon, which is particularly well done in terms of completeness and documentation richness.