AWS auto scaling single node groups? - aws-auto-scaling

Ok, I have a strange situation. Can't find anything quite like it online.
Within a platform that I'm helping run, we have a couple of services that really can only run on a single node. Yes, our developers are working on fixing this, but in the meantime... We are currently using HA to handle failover to a hot standby, but we would like to try to use AWS Auto Scaling Groups, for consistency in our architecture.
We've tried setting the min/max/des to 1/1/1, with some success. However, we've had an issue arise where it takes about 3 minutes for the ASG to spin down a failed EC2, and spin up a replacement. During this time, havoc ensues within the platform.
My question is this, is there a way to make the ASG start the new EC2 instance, before stopping the unhealthy one?

From my current knowledge, the answer is no.
The ASG schedule almost immediately the replacement instance. Unfortunately, the instance needs some time to warm up and pass the health check.
https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html

Related

AWS ECS app fast auto scaling for video encoding. What is the best way?

I am currently running a video encoding application on ECS but auto scaling is my biggest problem.
Users start live video encoding jobs from a front end. Once a job is placed, this is added as a redis queue (rq) job that runs on an ECS task placed on a c5d.large instance using ffmpeg.
Autoscaling is currently based on alarms. If cpu is > than a set percentage, a new instance and task is spawned. If cpu is low, instances are checked and if no jobs are running they are destroyed.
This is not a bad solution but it feels clunky and slow. If a user wants to start two jobs one right after the other, it takes a couple of minutes for the instance to spawn + task to be placed (even using warm groups).
Plus cloudwatch alarms take a while to refresh and are not a super reliable way of defining work that is being done (a video encoding at 720p will use less cpu than one at 1080p and thus mess all my alarm settings).
Is there a better solution that someone can guide me to that allows for fast and precise autoscaling other than relying on cloudwatch alarms? I am tempted to try to create my own autoscaling system based on current executing jobs / workers and spawn/destroy instances directly calling the API from my code, but I'm hoping to find a better solution directly from within AWS.
Thanks
I too have this exact problem, AWS already has mediaconvert/elastictranscoder but it's just too expensive & I decided to create my own firstly on lambda with SST.dev (serverless) where all jobs are a single function invocation but I had issues with 15mins function timeout mostly because I'm not copying codecs.
scaling at this point I would think is Kubernetes. This is the sort of problem that Kubernetes is intended to handle (dynamic resource scaling on demand). Kubernetes is rather non-trivial. K8s is what the industry has settled on for the most part, so there are probably a lot of reasons to just go that route. You could start with K3S (psst! i just knew that today) and move up to K8s when you are ready.
Since you're trying to find a solution directly from within AWS, you can try EKS but I'm not completely sure what the best would be.

AWS Autoscaling Group EC2 instances go down during cron jobs

I tried autoscaling groups and alternatively just a bunch of EC2 instances tied by load balancer. Both configs are working fine at first glance.
But, when the EC2 is a part of autoscaling group it goes down sometimes. Actually it happens very often, almost once a day. And they go down in a "hard reset" way. The ec2 monitoring graphs show that CPU usage goes up to 100%, then the instance become not responsive and then it is terminated by autoscaling group.
And it has nothing to do with my processes on these instances.
When the instance is not a part of Autoscaling groups, it can work without the CPU usage spikes for years.
The "hard reset" on autoscaling group instances are braking my cron jobs. As much as I like the autoscaling groups I cannot use it.
It there a standard way to deal with the "hard resets"?
PS.
The cron jobs are running PHP scripts on Ubuntu in my case. I managed to make only one instance running the job.
It sounds like you have a health check that is failing when your cron is running, as as a result the instance is being taken out of service.
If you look at the ASG, there should be a reason listed for why the instance was taken out. This will usually be a health check failure, but there could be other reasons as well.
There are a couple things you can do to fix this.
First, determine why your cron is taking 100% of CPU, and how long it generally takes.
Review your health check settings. Are you using HTTP or TCP? What is the interval, and how many checks have to fail before it is taken out of service?
Between those two items, you should be able to adjust the health checks so that it doesn't take it out of service during the cron running time. It is possible that the instance is failing, typically this would be because it runs out of memory. If that is the case, you may want to consider going to a large instance type and/or enabling swap.
Once I had a similar issue, in that situation was the system auto update running. The system (Windows server) was downloaded a big update and took 100% of the CPU during hours. My suggestion is to try to monitoring which service is running at that moment (even if the SO is Linux), also check for any schedule task (as looks like it is running periodically). Other than that try to keep the task list opened during the event and see what is going on.

AWS Binpack placement strategy resulting in issues during instance autoscaling

Here is the scenario:
We are running Docker containers in an AWS ECS cluster. Previously, we were not using any placement strategy for containers. For minimizing the number of instances within the cluster, we tried introducing binpack placement strategy. After that, whenever we try to deploy containers multiple at a time(in parallel), the instances do not autoscale and stay at the minimum limit set for them. We are not sure what went wrong. Most of the services are not reaching steady state due to this. For now, we have removed binpack, and again it has started working perfectly and we are able to deploy in parallel.
Though, there is no issue when we deploy one service at a time and everything seems fine.
We are using t2.large type instances in our case.
Instance auto-scaling is happening based on memory reservation(>80% for 1 minute).
Looking at the graph, we can check memory threshold is not getting reached. It crosses >80 threshold only for few seconds, and then again goes down. This is a strange behaviour according to me.
Does binpack does not support t2 type instances? Or is there any other case I am missing?

AWS instance schedulewith auto scaling groups

I've configured AWS instance scheduler, everything is working as expected.
The issue I'm having is each instance has a autoscaling group in my dev environment and i'm unable to shutdown instances without them beign terminated by autoscale group when it does a health check and notices its down.
Has anyone figured out an automated solution to this without me having to manually suspend ASG? Since the whole purpose of this is to stop the instances after hours I'm unable to intervene to suspend/resume ASG.
Thanks in advance!
"Auto Scaling" and "AWS Instance Scheduler" don't really fit together nicely. Do you really need ELB for Dev environments? I feel this is overkill.
Anyway, if you still want to use ELB + AutoScaling and would like to shutdown the boxes during off hours, you can set "AutoScaling" to ZERO for the hours you want using Scheduled Scaling approach.

AWS SQS + autoscale

Assuming that I have a queue and multiple instances in autoscaling group.
For scaling up case, it's quite easy to determine.If the length of the queue grows, autoscaling group will spawn new instances.
For scaling down case, it's a bit tricky here. If the length of the queue shrinks, autoscaling group will terminate the instances. It sounds obvious, but the question is: what happens if the instances which are still processing messages being terminated?
Of course we can use some metrics like CPU Utilisation, Disk Read/Write, etc to check. But I don't think it's a good idea. I'm thinking about a central place where instances will be registered for whether they are processing or not, so that the free ones can be determined and so terminated properly.
Any thoughts for this? Thanks.
The accepted answer on this thread:
Amazon Auto Scaling API for Job Servers
gives you two possibilities to handle your situation. One of them should work for you. Also keep in mind, that you don't necessarily want to kill and instance as soon as there is no work - when they spin up, you are going to pay for the whole hour wether you use 59 minutes or 1 minute, so you may want to build that into your solution - spin up instances fast, turn them off slowly.