Mesos - dynamic cluster size - amazon-web-services

Is it possible in Mesos to have dynamic cluster size - with total cluster CPU and RAM quotas set?
Mesos knows my AWS credentials and spawns new ec2 instances only if there is a new job that cannot fit into existing resources. (AWS or other cloud provider). Similar to that - when the job is finished it could kill the ec2 instance.
It can be Mesos plugin/framework or some external tool - any help appreciated.
Thanks

What we are doing is we are using Mesos monitoring tools and HTTP endpoints # http://mesos.apache.org/documentation/latest/endpoints/ to monitor the cluster.
We have our own framework that gets all the relevant information from the master and slave nodes and our algorithm uses that information to scale the cluster.
For example if the cluster CPU utilization is > 0.90 we bring up a new instance and register that slave to master.

If I understand you correctly you are looking for a solution to autoscale your Mesos cluster?
What some people will do on AWS for example is to create an autoscaling group allowing them to scale up and down the number of agents/slave nodes depending on their needs.
Note that the trigger when to scale up/down are usually application dependent (e.g., could be ok for one app to be at a 100% utilization while for others 80% should already trigger a scale-up action).
For an example of using the AWS auto scaling groups you could have a look at Mesosphere DCOS Community edition (note as mentioned above you will still have to write the trigger code for scaling your scaling group).

AFAIK, the Mesos can not autoscaling itself; it need someone to start Mesos Agent for the cluster. One option is to build a script and be managed by Marathon, this script is to start/stop agents after comparing your pending tasks in the framework and Mesos cluster.

Related

How to Autoscale a GCP Managed Instance Group using a Rabbitmq VM outside the group

I am using GCP and I have a specific problem to solve where I want to use the metrics from a RabbitMQ instance to control the autoscaling requirements of a Managed Instance Group. Do note that this RabbitMQ instance is outside this group and is used only to maintain the messages in the queue.
I want to scale up the number of instances in the group, when the number of current messages in the queue exceeds the number of available consumers. I had implemented the same in AWS using Amazon MQ integrated with RabbitMQ to autoscale for an ECS Cluster of Instances.
I have installed an OPS agent on the RabbitMQ Instance so that I can monitor the queue-based stats in a dashboard, but I am not sure how these metrics can be used to scale the instance as no specification on the MIG config page seems to point to the accessibility of these metrics.
My question is that, is it possible to scale the instances in an MIG through the metrics of an external instance like in my case? This question arises because the documentation on GCP seems to point out that autoscaling can be used only on metrics of the instances within the group.
If not, I would like to understand other ways I can implement the same by perhaps monitoring a consumer-based metric.
Custom metrics can be used for triggering the auto scaling feature. This document outlines clearly how to configure custom metrics for triggering auto scaling of MIG instances. The configuration involves three simple steps.
Create a custom metric for the Rabbitmq queue in cloud monitoring.
Create a service account and give sufficient permissions for performing
auto scaling actions.
Create a trigger using these custom metrics for scaling the managed
instance group.

How can I understand `Nodes` in EKS Fargate?

I deployed a EKS cluster and a fargate profile. Then I deployed a few application to this cluster. I can see these fargate instances are launched.
when I click each of this instance, it shows me some information like os, image etc. But it doesn't tell me the CPU and memory. When I look at fargate pricing: https://aws.amazon.com/fargate/pricing/. It is calculated based on CPU and Memory.
I have used ECS and it is very clear that I need to provision CPU/Memory in service/task level. But I can't find anything in EKS.
How do I know how much resources they are consuming?
With Fargate you don`t have provision, configure or scale virtual machines to run your containers so that they become fundamental compute primitive.
This solution model is called serverless where you are being charged for only the compute resources and storage that are need to execute some piece of your code. It does not mean that there are not server involved in this, it just you don`t need to care about those.
To monitor there those you can use CloudWatch. Below documents describe how this can be achieved:
How do I troubleshoot high CPU utilization on an Amazon ECS task on
Fargate?
How can I monitor high memory utilization for Amazon ECS tasks on
Fargate?
It is worth to mention that Fargate is just a launch type for ECS (Another one is EC2). Please have a look at the diagram in this document for clear image of how those are connected. The CloudWatch metrics are collected automatically for Fargate. If you are using the AKS with Fargate you can monitor them with usage of metrics-addon or prometheus inside your kubernetes cluster.
Here's an example of monitoring Fargate with Prometheus. Notice that it scrapes the metrics from CloudWatch.

AWS ECS cluster auto-scaling vs service auto-scaling

this is my first time using amazon ecs service.
I have searched online for awhile to understand auto-scaling with ecs services.
I found there are two options to auto-scale my application. But, There are some I don't understand.
First is Service auto scaling which track the cpu/memory metric from cloudWatch and increase the task number accordingly.
Second is cluster auto scaling which needs to create auto scaling resource, create capacity provider and so on. But, in Tutorial: Using cluster auto scaling, it can run the task definition without service. But it also seems increasing the task number in the end.
So what's the different and 'pros and cons' between them?
I will try to explain briefly.
A Task is a container which is running our code(from a docker image).
As Service is making sure that given desired no of tasks are maintained.
We will be running these services in ECS backed by EC2 or Fargate. Ec2 is machines managed by us. Fargate is machines managed by AWS.
Scaling:
Ultimately, We will be scaling the tasks just by setting desired no of tasks between min and max tasks, based on CPU or any other metric of individual task. This is called service auto scaling.
Fargate: Since AWS will manage necessary VMs behind the scenes, we can set any no of desired tasks we want and seamlessly scale without worrying about any infrastructure.
EC2: We can't seamlessly scale services because we need to add/remove EC2 instances behind the scenes too. We need to auto scale these instances also based on cpu or any other metrics of the Ec2 machines, which is called Cluster scaling.

How to shutdown EC2 instances backed by ECS to save the cost for staging/QA

We have hosted a docker container on AWS ECS with EC2 instances and would like to terminate/showdown these EC2 instances in the night & weekend for Staging/QA to save the cost.
Thanks in advance :)
The AWS Instance Scheduler is a simple AWS-provided solution that enables customers to easily configure custom start and stop schedules for their Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Relational Database Service (Amazon RDS) instances. The solution is easy to deploy and can help reduce operational costs for both development and production environments.
https://aws.amazon.com/solutions/implementations/instance-scheduler/
If you run the instances in an AutoScaling Group (ASG) , you could use scheduled policy to set a desired capacity of the ASG to zero for the off-peak times. A second policy would start it for work time.
Alternative would be setup a CloudWatch Event scheduled rule using cron with target of lambda function. The function would do same as the scaling policy. But because this is lambda function, you could also do some other things there. For example, do some pre-shutdown checks or post-shutdown cleanup.
This will work, because if your tasks run in service, ECS will automatically relaunch your tasks when the instances are back.
You could also manage the number of tasks using scheduling capability of Amazon ECS.

Autoscaling AWS nodes in a kubernetes cluster

I've found a few posts on a similar topic, but wanted to clarify:
If I am running Kubernetes in AWS (natively, e.g. by deploying with Kops), is there any mechanism that can deploy additional nodes to the AWS node ASG to cater for resource requirements?
For example, if I deploy a 2 worker node cluster (ASG) that has a total of 8gb of memory, and I create a few kubernetes deployments onto the cluster, where memory requirements become greater than 8gb, is there a mechanism that will abstractly scale the underlying ASG to provide the required resources with me needing to manually increase the size of the ASG?
Thanks in advance.
Have you tried the kubernetes autoscaler project?
It is AWS compatible so it should answer your requirements