ECS CapacityProvider not Scaling in - amazon-web-services

According to the docs, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-auto-scaling.html#update-ecs-resources-cas
Looking at the services, I have 3 services.
2 Replicas are both running on a single host and a DAEMON is on both hosts. So per the docs, the scale in protection should've been removed, and the error mentioned in the 2nd image below shouldn't have come.
I am facing the exact issue as this SO post but that is still unanswered.
Current Capacity Provider state:
Error in ASG:
Service definitions:
2 Replicas + 1 DAEMON in the first instance (first three tasks)
1 DAEMON in the second instance should've been removed due to scale-in, but it isn't.
Capacity Provider configuration:

It turns out, this was because AmazonECSManaged tag was missing in the ASG. After adding AmazonECSManaged=true, the cluster was able to scale-in

Related

Difference between AWS ECS Service Type Daemon & Constraint "One Task Per Host"

On an initial look AWS ECS "Daemon" Service Type and the placement constraint "One Task per Host" looks very similar. Can someone please guide me on the differences between the two and some real life examples of when one is preferred over another?
By "One Task Per Host" are you referring to the distinctInstance constraint?
distinctInstance means that no more than 1 instance of the task can be running on a server at a time. However the actual count of task instances across your cluster will depend on your desired task count setting. So if you have 3 servers in your cluster, you could have as little as 1 of the tasks running, and as much as 3 of the tasks running.
daemon specifies to ECS that one of these tasks has to be running on every server in the cluster. So if you have 3 servers in your cluster then you will have 3 instances of the task running, one on each server.

Jenkins AWS Spot fleet plugin doesn't automatically scale spot instances

Planned to use EC2 Spot instance/fleet as our jenkins slave solution based on this article https://jenkins.io/blog/2016/06/10/save-costs-with-ec2-spot-fleet/.
EXCEPTED
if the spot instances nodes remain free for the specified idle time (I have configured for 5 minutes), then Jenkins releases the nodes, and my Spot fleet nodes will be automatically scaled down.
ACTUAL
my spot instances is still running for days.Also, noticed when I have more pending jobs, Jenkins does not automatically scale my Spot fleet to add more nodes.
Automatic scale up/down supposed to be triggered automatically by aws service? or is this supposed to be triggered by the jenkins plugin?
CONFIGURATION
Jenkins version : 2.121.2-1.1
EC2 Fleet Jenkins Plugin version : 1.1.7
Spot instance configuration :
Request type : request & maintain
Target Capacity : 1
Spot fleet plugin configuration :
Max Idle Minutes Before Scaledown : 5
Minimum Cluster Size : 0
Maximum Cluster Size : 3
Any help or lead would be really appreciated.
I had the same issue and by looking in Jenkins' logs I saw it tried to terminate the instances but was refused to by AWS.
So, I checked in AWS Cloudtrail all the actions Jenkins tried and for which there was an error.
In order for the plugin to scale your Spot Fleet, check that your AWS EC2 Spot Fleet plugin has the following permissions with the right conditions:
ec2:TerminateInstances
ec2:ModifySpotFleetRequest
In my case, the condition in the policy was malformed and didn't work.

CloudFormation, CodeDeploy, ELB & Auto-Scaling Group

I am trying to build a stack with an ELB, an Auto-Scaling Group and a Pipeline (with CodeBuild and CodeDeploy).
I can't understand how it is supposed to work:
the auto-scaling group is starting two instances and wait X minutes before starting to check the instances state
the CodeDeploy application deployment group is waiting for the Auto-Scaling group to be created and ready
the pipeline takes about 10 minutes to start deploying the application
My issue is when I create the stack, it looks like there is a loop: AG requires an application from CodeDeploy and CodeDeploy requires an AG stabilized. To be clear, when the application is ready to deploy, my Auto-Scaling group is already starting to terminate instances and starting new ones, so the CodeDeployment is trying to deploy to instances already terminated or terminating.
I don't really want to configure HealthCheckGracePeriod and PauseTime to be ~10-15 minutes... it is way too long.
Are there any best practices for CloudFormation + ELB + AG + CodeDeploy via a Pipeline?
What should be the steps to achieve that?
Thank you!
This stopping/staring the instances is most probably linked to the Deployment Type: in-place vs. blue/green.
I have tried both in my setup, and I will try to summarize how they work.
Let's say that for this example, you have an Autoscaling group which at the time of deploying the application has 2 running instances and the deployment configuration is OneAtATime. Traffic is controlled by the Elastic Load Balancer. Then:
In-place deployment:
CodeDeploy gets notified of a new revision available.
It tells the ELB to stop directing traffic to 1st instance.
Once traffic to one instance is stopped, it starts the deployment process: Stop the application, download bundle etc.
If the deployment is successful (validate service hook returned 0), it tells ELB to resume traffic to that instance.
At this point, 1 instance is running the old code and 1 is running the new code.
Right after the ELB stops traffic to the 2nd instance, and repeats the deployment process there.
Important note:
With ELB enabled, the time it takes to block traffic to instance before deployment, and time it takes to allow traffic after it are directly dependent on your health check: time = Healthy threshold * Interval.
Blue/green deployment:
CodeDeploy gets notified of a new revision available.
It copies your Autoscaling Group: the same configuration of the group (including scaling policies, scheduled actions etc.) and the same number of instances (using same AMI as your original AG) there were there at the start of deployment - in our case 2.
At this point, there is no traffic going to the new AG.
CodeDeploy performs all the usual installation steps to one machine.
If successful, it deploys to the second machine too.
It directs traffic from the instances in your old AG to the new AG.
Once traffic is completely re-routed, it deletes the old AG and terminates all its instances (after a period specified in Deployment Settings - this option is only available if you select Blue/Green)
Now ELB is serving only the new AG.
From experience:
Blue/green deployment is a bit slower, since you need to wait for the
machines to boot up, but you get a much safer and fail-proof deployment.
In general I would stick with Blue/green, with load balancer
enabled and the Deployment Configuration: AllAtOnce (if it fails,
customers won't be affected since the instances won't be receiving
traffic. But it will be 2x as fast since you deploy in parallel
rather than sequentially).
If your health checks and validate
service are throughout enough, you can probably delete the original
AG with minimal waiting time (5 minutes at the time of writing the
post).

Aws auto scaling group failure for elasticbeanstalk

So my elasticbeanstalk has been successfully deployed. However autoscaling seems to be an issue
ASG configo
Desired = 2
Min = 2
Max 4
I have enough instance for the instance type i am using.
When i test my app, just to simulate load, the instances increases to 3, but fails when it tries to bring up another(4th) instance with below errors
Description: Launching a new EC2 instance.
Status Reason: Your quota allows for 0 more running instance(s). You requested at least 1. Launching EC2 instance failed.
Reason: an instance was started in response to a difference between desired and actual capacity, increasing the capacity from
Im i missing something somewhere ?
Links will be quite helpfull is some one has seen this before
Looking at the error :
Status Reason: Your quota allows for 0 more running instance(s). You requested at least 1.
From the error message it shows you have hit the limit of EC2 instances. Please check how many instances you are running.
The specifics on the limits can be found in the FAQ here
You can request limit increase as explained here
Thanks

Why every time Elastic Beanstalk issues a command to its instance it always timed out?

I have a PHP application deployed to Amazon Elastic Beanstalk. But I notice a problem that every time I push my code changes via git aws.push to the Elastic Beanstalk, the application deployed didn't picked up the changes. I checked the events log on my application Beanstalk environment and notice that every time the Beanstalk issues:
Deploying new version to instance(s)
it's always followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own):
[i-d5xxxxx]
The same thing happens when I try to request snapshot logs. The Beanstalk issues:
requestEnvironmentInfo is starting
then after a few minutes it's again followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-d5xxxxx].
I had this problem a few times. It seems to affect only particular instances. So it can be solved by terminating the EC2 instance (done via the EC2 page on the Management Console). Thereafter, Elastic Beanstalk will detect that there are 0 healthy instances and automatically launch a new one.
If this is a production environment and you have only 1 instance and you want minimal down time
configure minimum instances to 2, and Beanstalk will launch another instance for you.
terminate the problematic instance via EC2 tab, Beanstalk will launch another instance for you because minimum instance is 2
configure minimum instance back to 1, Beanstalk will remove one of your two instances.
By default Elastic Beanstalk "throws a timeout exception" after 8 minutes (480 seconds defined in settings) if your commands did not complete in time.
You can set an higher time up to 30 minutes (1800 seconds).
{
"Namespace": "aws:elasticbeanstalk:command",
"OptionName": "Timeout",
"Value": "1800"
}
Read here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html
Had the same issue here (single t1.micro instance).
Did solve the problem by rebooting the EC2 instance via the EC2 page on the Management Console (and not from EB page).
Beanstalk deployment (and other features like Get Logs) work by sending SQS commands to instances. SQS client is deployed to instances and checks queue about every 20 secs (see /var/log/cfn-hup.log):
2018-05-30 10:42:38,605 [DEBUG] Receiving messages for queue https://sqs.us-east-2.amazonaws.com/124386531466/93b60687a33e19...
If SQS Client crashes or has network problems on t1/t2 instances then it will not be able to receive commands from Beanstalk, and deployment would time out. Rebooting instance restarts SQS Client, and it can receive commands again.
An easier way to fix SQS Client is to restart cfn-hup service:
sudo service cfn-hup restart
In the case of deployment, an alternative to shutting down the EC2 instances and waiting for Elastic Beanstalk to react, or messing about with minimum and maximum instances, is to simply perform a Rebuild environment on the target environment.
If a previous deployment failed due to timeout then the new version will still be registered against the environment, but due to the timeout it will not appear to be operational (in my experience the instance appears to still be running the old version).
Rebuilding the environment seems to reset things with the new version being used.
Obviously there's the downside with that of a period of downtime.
I think is the correct way to deal with this.
I think the correct way to deal with this is to figure out the cause of the timeout by doing what this answer suggests.
chongzixin's answer is what needs to be done if you need this fixed ASAP before investigating the reason for a timeout.
However, if you do need to increase timeout, see the following:
Add configuration files to your source code in a folder named .ebextensions and deploy it in your application source bundle.
Example:
option_settings:
"aws:elasticbeanstalk:command":
Timeout: 2400
*"value" represents the length of time before timeout in seconds.
Reference: https://serverfault.com/a/747800/496353
"Restart App Server(s)" from the "Actions" menu in Elastic Beanstalk management dashboard followed by eb deploy fixes it for me.
Visual cue for the first instruction
After two days of checking random issues, I restarted both EC2 instances one after another to make sure there is no downtime. Site worked fine but after a while, website started throwing error 504.
When I checked the http server, nginx was off and "Out of HDD space" was thrown. "Increased the HDD size", elastic beanstalk created new instances and the issue was fixed.
For me, the problem was my VPC security group rules. According to the docs, you need to allow outbound traffic on port 123 for NTP to work. I had the port closed, so the clock was drifting, and so the EC2's were becoming unresponsive to commands from the Elastic Beanstalk environment, taking forever to deploy (only to time out) failing to get logs, etc.
Thank you #Logan Pickup for the hint in your comment.